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POLYMORPHISMS IN THE GLUCOSE-6 
PHOSPHATE DEHYDROGENASE LOCUS 

CROSS-REFERENCE TO RELATED 
APPUCAnON 

The present application claims priority from provisional 
application 60/029,374, filed Oct. 28, 1996, which is incor- 
porated by reference in its entirety for all purposes. 

BACKGROUND OF THE INVENTION 

The genomes of all organisms undergo spontaneous muta- 
tion in the course of their continuing evolution generating 
variant forms of progenitor sequences (GuseUa, Ann. Rev. 
Biochem. 55, 831-854 (1986)). The variant form may confer 
an evolutionary advantage or disadvantage relative to a 
progenitor form or may be neutral. In some instances, a 
variant form confers a lethal disadvantage and is not trans- 
mitted to subsequent generations of the organism. In other 
instances, a variant form confers an evolutionary advantage 
to the species and is eventually incorporated into the DNA 
of many or most members of the species and effectively 
becomes the progenitor form. In many instances, both 
progenitor and variant fonn(s) survive and co-exist in a 
species population. The coexistence of multiple forms of a 
sequence gives rise to polymorphisms. 

Several different types of polymorphism have been 
reported. A restriction fragment length polymorphism 
(RFLP) means a variation in DNA sequence that alters the 
length of a restriction fragment as described in Botstein et 
al.,i4m. 7. Hum. Genet. 32. 314-331 (1980). The restriction 
fragment length polymorphism may create or delete a 
restriction site, thiis changing the length of the restriction 
fragment. RFLPs have been widely used in human and 
animal genetic analyses (see WO 90/13668; WO90/11369; 
Donis-Keller. Cell 51, 319-337 (1987); Under ct al.. Genet- 
ics 121, 85-99 (1989)). When a heritable trait can be linked 
to a particular RFLP, the presence of the RFLP in an 
individual can be used to predict the likelihood that the 
animal will also exhibit the trait. 

Other polymorphisms take the form of short tandem 
repeats (STRs) that include tandem di-, tri- and tetranucle- 
otide repeated motifs. These tandem repeats are also referred 
to as variable number tandem repeat (VNTR) polymor- 
phisms. VNTRs have been used in identity and paternity 
analysis (U.S. Pat. No. 5,075,217; Armour et al., FEES Lett. 
307, 113-115 (1992); Horn et al., WO 91/14003; Je&eys, 
EP 370,719), and in a large number of genetic mapping 
studies. 

Other polymorphisms take the form of single nucleotide 
variations between individuals of the same species. Such 
polymorphisms are far more frequent than RFLPS, STRs 
and VNTRs. Some single nucleotide polymorphisms occur 
in protein-coding sequences, in which case, one of the 
polymorphic forms may give rise to the expression of a 
defective or other variant protein and, potentially, a genetic 
disease. Examples of genes, in which polymorphisms within 
coding sequences give rise to genetic disease include 
p-globin (sickle cell anemia) and CFTR (cystic fibrosis). 
Other single nucleotide polymorphisms occur in noncoding 
regions. Some of these polymorphisms may also result in 
defective protein expression (e.g., as a result of defective 
splicing). Other single nucleotide polymorphisms have no 
phenotypic effects. 

Single nucleotide polymorphisms can be used in the same 
manner as RFLPs, and VNTRs but offer several advantages. 
Single nucleotide polymorphisms occur with greater fre- 
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quency and are spaced more uniformly throughout the 
genome than other forms of polymorphism. The greater 
frequency and uniformity of single nucleotide polymor- 
phisms means that there is a greater probability that such a 

5 polymorphism will be found in close proximity to a genetic 
locus of interest than would be the case for other polymor- 
phisms. Also, the different forms of characterized single 
nucleotide polymorphisms are often easier to distinguish 
that other types of polymorphism (e.g., by use of assays 

10 employing aUele-specific hybridization probes or primers). 
Despite the increased amount of nucleotide sequence data 
being generated in recent years, only a minute proportion of 
the total repository of polymorphisms in humans and other 
organisms has so far been identified. The paucity of poly- 

^5 morphisms hitherto identified is due to the large amount of 
work required for their detection by conventional methods. 
For example, a conventional approach to identifying poly- 
morphisms might be to sequence the same stretch of oligo- 
nucleotides in a population of individuals by didoxy 

20 sequencing. In this type of approach, the amount of work 
increases in proportion to both the length of sequence and 
the number of individuals in a population and becomes 
impractical for large stretches of DNA or large numbers of 
persons. 

25 

SUMMARY OF THE INVENTION 

The invention provides nucleic acid segments of between 
10 and 100 bases containing at least 10, 15 or 20 contiguous 

3Q amino acids from any of the sequences shown in any of 
TABLE 2 (SEQ ID N0S:1 and 2), TABLE 3 (SEQ ID 
N0S:3 and 4), TABLE 4 (SEQ ID N0S:5 and 6), TABLE 5 
(SEQ ID NOS:7-15), TABLE 6 (SEQ ID N0S:16 and 17), 
TABLE 7 (SEQ ID N0S:18 and 19), TABLE 8 (SEQ ID 

35 NOS:20 and 21), TABLE 9 (SEQ ID NOS:22-24), TABLE 
10 (SEQ ID NOS:25-27) and TABLE U (SEQ ID NOS:28 
and 29) including a polymorphic site. Complements of these 
segments are also included. The segments can be DNA or 
RNA, and can be double - or single-stranded. Some segments 

^ arc 10-20 or 10-50 bases long. Preferred segments include 
a diallelic polymorphic 25 site. 

The invention further provides allele-spccific oligonucle- 
otides that hybridizes to a sequence shown in TABLE 2 
(SEQ ID NOS:land 2), TABLE 3(SEQ ID N0S:3 and 4), 

45 TABLE 4 (SEQ ID N0S:5 and 6). TABLE 5 (SEQ ID 
NOS:7-15), TABLE 6 (SEQ ID N0S:16 and 17), TABLE 7 
(SEQ ID N0S:18 and 19), TABLE 8 (SEQ ID NOS:20 and 
21), TABLE 9 (SEQ ID N0S:22-24), TABLE 10 (SEQ ID 
N0S:25-27) and TABLE 11 (SEQ ID NOS:28 and 29) or its 

50 complement. These oligonucleotides can be probes or prim- 
ers. 

The invention further provides a method of analyzing a 
nucleic acid from an individual. The method determines 
which base is present at any one of the polymorphic sites 

55 shown in TABLE 2(SEQ ID N0S:1 and 2). TABLE 3 (SEQ 
ID N0S:3 and 4), TABLE 4 (SEQ ID N0S:5 and 6). TABLE 
5 (SEQ ID NOS:7-15), TABLE 6 (SEQ ID N0S:16 and 17), 
TABLE 7 (SEQ ID NOS:18 and 19), TABLE 8 (SEQ ID 
NOS:20 and 21), TABLE 9 (SEQ ID NOS:22-24), TABLE 

60 10 (SEQ ID NOS:25-27) and TABLE 11 (SEQ ID NOS:28 
and 29). Optionally, a set of bases occupying a set of the 
polymorphic sites shown in TABLE 2 (SEQ ID N0S:1 and 
2), TABLE 3 (SEQ ID N0S:3 and 4), TABLE 4 (SEQ ID 
N0S:5 and 6). TABLE 5 (SEQ ID NOS:7-15), TABLE 6 

65 (SEQ ID N0S:16 and 17), TABLE 7 (SEQ ID N0S:18 and 
19), TABLE 8 (SEQ ID NOS:20 and 21). TABLE 9 (SEQ ID 
NOS:22-24), TABLE 10 (SEQ ID NOS:25-27) and TABLE 
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11 (SEQ ID NOS:28 and 29). is determined. This type of 
analysis can be performed on a plurality of individuals who 
arc tested for the presence of a disease phenotype. The 
presence or absence of disease phenotype can then be 
correlated with a base or set of bases present at the poly- 
morphic sites in the individuals tested. 

DEFINITIONS 

An oligonucleotide can be DNA or RNA, and single- or 
double-stranded. Oligonucleotides can be naturally occur- 
ring or synthetic, but are typically prepared by synthetic 
means. Preferred oligonucleotides of the invention include 
segments of DNA, or their complements including any one 
of the polymorphic sites shown in TABLE 2 (SEQ ID 
N0S:1 and 2), TABLE 3 (SEQ ID N0S:3 and 4), TABLE 4 
(SEQ ID N0S:5 and 6). TABLE 5 (SEQ ID NOS:7-15). 
TABLE 6 (SEQ ID N0S:16 and 17). TABLE 7 (SEQ ID 
N0S:18 and 19), TABLE 8 (SEQ ID NOS:20 and 21). 
TABLE 9 (SEQ ID NOS:22-24), TABLE 10 (SEQ ID 
NOS:25-27) and TABLE 11 (SEQ ID NOS:28 and 29). The 
segments arc usually between 5 and 100 bases, and often 
between 5-10, 5-20. 10-20. 10-50. 20-50 or 20-100 bases. 
The polymorphic site can occur within any position of the 
segment. The segments can be from any of the allelic forms 
of DNA shown in TABLE2 (SEQ ID NOS: 1 and 2), TABLE 
3 (SEQ ID NOS:3 and 4), TABLE 4 (SEQ ID N0S:5 and 6), 
TABLE 5 (SEQ ID N0S:7-15), TABLE 6 (SEQ ID N0S:16 
and 17), TABLE 7 (SEQ ID NOS: 18 and 19), TABLE 8 
(SEQ ID NOS:20 and 21). TABLE 9 (SEQ ID NOS:22-24), 
TABLE 10 (SEQ ID NOS:25-27) and TABLE 11 (SEQ ID 
NOS:28 and 29). 

Hybridization probes are oligonucleotides capable of 
binding in a base-speciiic manner to a complementary strand 
of nucleic acid. Such probes include peptide nucleic acids, 
as described in Nielsen et al., Science 254, 1497-1500 
(1991). 

The term primer refers to a single-stranded oligonucle- 
otide capable of acting as a point of initiation of template - 
directed DNA synthesis under appropriate conditions (i.e.. in 
the presence of four different nucleoside triphosphates and 
an agent for polymerization, such as. DNA or RNA poly- 
merase or reverse transcriptase) in an appropriate bufifer and 
at a suitable temperature. The appropriate length of a primer 
depends on the intended use of the primer but typically 
ranges from 15 to 30 nucleotides. Short primer molecules 
generally require cooler temperatures to form suflBciently 
stable hybrid complexes with the template. A primer need 
not reflect the exact sequence of the template but must be 
siiflSciently complementary to hybridize with a template. 
The term primer site refers to the area of the target DNA to 
which a primer hybridizes. The term primer pair means a set 
of primers including a 5' upstream primer that hybridizes 
with the 5* end of the DNA sequence to be amplified and a 
3*, downstream primer that hybridizes with the complement 
of the 3' end of the sequence to be amplified. 

Linkage desaibes the tendency of genes, alleles, loci or 
genetic markers to be inherited together as a result of their 
location on the same chromosome, and can be measured by 
percent recombination between the two genes, alleles, loci 
or genetic markers. 

Polymorphism refers to the occurrence of two or more 
genetically determined alternative sequences or alleles in a 
population. A polymorphic marker or site is the locus at 
which divergence occurs. Preferred markers have at least 
two alleles, each occurring at frequency of greater than 1%, 
and more preferably greater than 10% or 20% of a selected 



population. A polymorphic locus may be as small as one 
base pair. Polymorphic markers include restriction fragment 
length polymorphisms, variable number of tandem repeats 
(VNTR's), hypervariable regions, minisatellites, dinucle- 

5 otide repeats, trinucleotide repeats, tetranucleoiide repeats, 
simple sequence repeats, and insertion elements such as Alu. 
The first identified allelic form is arbitrarily designated as a 
the reference form and other allelic forms are designated as 
alternative or variant alleles. The allelic form occurring most 

10 frequently in a selected population is sometimes referred to 
as the wildtype form. Diploid organisms may be homozy- 
gous or heterozygous for allelic forms. A diallehc polymor- 
phism has two forms. A triallclic polymorphism has three 
forms. 

^5 A single nucleotide polymorphism occurs at a polymor- 
phic site occupied by a single nucleotide, which is the site 
of variation between allelic sequences. The site is usually 
preceded by and followed by highly conserved sequences of 
the allele (e.g., sequences that vary in less than 1/100 or 

20 1/1000 members of the populations). 

A single nucleotide polymorphism usually arises due to 
substitution of one nucleotide for another at the polymorphic 
site. A transition is the replacement of one purine by another 
purine or one pyrimidinc by another pyrimidine. A trans- 
version is the replacement of a purine by a pyrimidine or 
vice versa. Single nucleotide polymorphisms can also arise 
from a deletion of a nucleotide or an in.sertioD of a nucleotide 
relative to a reference allele. 

Hybridizations are usually performed under stringent 
conditions, for example, al a salt concentration of no more 
than IM and a temperature of at least 25** C. For example, 
conditions of 5X SSPE (750 mM NaCI, 50 mM 
NaPhosphate, 5 mM EDTA, pH 7.4) and a temperature of 

25 25**-30** C. are suitable for allele-specific probe hybridiza- 
tions. 

An isolated nucleic acid means an object species 
invention) that is the predominant species present (i.e., on a 
molar basis it is more abundant than any other individual 

40 species in the composition). Preferably, an isolated nucleic 
acid comprises at least about 50, 80 or 90 percent (on a 
molar basis) of all macromoleciilar species present. Most 
preferably, the object species is purified to essential homo- 
geneity (contaminant species cannot be detected in the 

45 composition by conventional detection methods). 

DESCRIPTION OF THE PRESENT INVENTION 

I. Novel Polymorphisms of the Invention 
The human gIucosc-6-phosphate dehyrogenase locus 

50 (G6PD) encompasses more than 50,000 bp and resides on 
the X chromosome. A complete prototypical sequence of the 
G6PD locus has been published. That locus has remained 
relatively unexplored due to the cost and difficulty of con- 
ventional sequence analysis. The published sequence shows 

55 that the G6PD locus contains at least two genes, the G6PD 
gene and the 2_19 gene. Those genes span approximately 
16,000 bp and 10,000 bp, respectively. The enzyme G6PD 
play a fundamental role in glucose metabolism. The function 
of the 2-19 polypeptide product, however, has not been 

60 shown. 

The present application provides 10 polymorphisms at 10 
sequence tagged sites in the human G6PD locus. Table 1 
shows the base occupied at those ten sites in 10 individuals. 
The sequences flanking each of these polymorphisms are 
65 shown in TABLE 2 (SEQ ID N0S:1 and 2), TABLE 3 (SEQ 
ID N0S:3 and 4). TABLE 4 (SEQ ID N0S:5 and 6). TABLE 
5 (SEQ ID NOS:7-15), TABLE 6 (SEQ ID NOS: 16 and 17). 
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TABLE 7 (SEQ ID N0S:18 and 19), TABLE 8 (SEQ ID 
NOS:20 and 21), TABLE 9 (SEQ ID NOS:22-24), TABLE 
10 (SEQ ID NOS:25-27) and TABLE 11 (SEQ ID NOS:28 
and 29). The polymorphic site is flanked by bold lines in the 
table. The sequences designated Ml-MlO represent novel 
allelic variants of this site. The designation N as it appears 
in Tables 1-11 means the identity of a base was not 
determined. 

II. Analysis of Polymorphisms 

A. Preparation of Samples 

Polymorphisms are detected in a target nucleic acid from 
an individual being analyzed. For assay of genomic DNA, 
virtually any biological sample (other than pure red blood 
cells) is suitable. For example, convenient tissue samples 
include whole blood, semen, saliva, tears, urine, fecal 
material, sweat, buccal, skin and hair. For assay of cDNA or 
mRNA, the tissue sample must be obtained from an organ in 
which the target nucleic acid is expressed. 

Many of the methods described below require amplifica- 
tion of DNA from target samples. This can be accomplished 
by e.g., PCR. See generally PCR Technology: Principles and 
Applications for DNA Amplification (ed. H. A..Erlich, Free- 
man Press. N.Y., N.Y, 1992); PCR Protocols: A Guide to 
Methods and Applications (eds. Innis, et al., Academic 
Press, San Diego, Calif., 1990); MattQa et Nucleic Acids 
Res. 19, 4967 (1991); Eckert et al., PCR Methods and 
Applications 1, 17 (1991); PCR (eds. McPherson et al., IRL 
Press, Oxford); and U.S. Pat. No. 4,683,202 (each of which 
is incorporated by reference for all purposes). 

Other suitable amplification methods include the ligase 
chain reaction (LCR) (see Wu and Wallace, Genomics 4, 560 

(1989) , Landegren et al, Science 241, 1077 (1988). tran- 
scription amplification (Kwoh et al., Proc, Natl. Acad. ScL 
USA 86, 1173 (1989)), and self-sustained sequence replica- 
tion (Guatelli et al, Proc, Nat. Acad ScL USA, 87, 1874 

(1990) ) and nucleic acid based sequence amplification 
(NASBA). The latter two amplification methods involve 
isothermal reactions based on isothermal transcription, 
which produce both single stranded RNA (ssRNA) and 
double stranded DNA (dsDNA) as the amplification prod- 
ucts in a ratio of about 30 or 100 to 1, respectively. 

B. Detection of Polymorphisms in Target DNA 
There arc two distinct types of analysis depending 

whether a polymorphism in question has already been 
characterized. The first type of analysis is sometimes 
referred to as de novo characterization. This analysis com- 
pares target sequences in different individuals to identify 
points of variation, i.e., polymorphic sites. By analyzing a 
groups of individuals representing the greatest ethnic diver- 
sity among humans and greatest breed and spedes variety in 
plants and animals, patterns characteristic of the most com- 
mon alleles/haplotypes of the locus can be identified, and the 
frequencies of such populations in the population deter- 
mined. Additional allelic frequencies can be determined for 
subpopulations characterized by oiteria such as geography, 
race, or gender. The de novo identification of the polymor- 
phisms of the invention is described in the Examples section. 
The second type of analysis is determining which fonn(s) of 
a characterized polymorphism are present in individuals 
under test. There are a variety of suitable procedures, which 
are discussed in turn. 
1. Allele-Specific Probes 

The design and use of allele-specific probes for analyzing 
polymorphisms is described by e.g., Saiki et al.. Nature 324, 
163-166 (1986); Dattagupta, EP 235,726, Saiki, WO 
89/11548. Allele-specific probes can be designed that 
hybridize to a segment of target DNA firom one individual 
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but do not hybridize to the corresponding segment from 
another individual due to the presence of different polymor- 
phic forms in the respective segments from the two indi- 
viduals. Hybridization conditions should be sufficiently 

5 stringent that there is a significant difference in hybridization 
intensity between alleles, and preferably an essentially 
binary response, whereby a probe hybridizes to only one of 
the alleles. Some probes arc designed to hybridize to a 
segment of target DNA such that the polymorphic site aligns 

10 with a central position (e.g., in a 15 mer at the 7 position; in 
a 16 mer, at either the 8 or 9 position) of the probe. This 
design of probe achieves good discrimination in hybridiza- 
tion between different allelic forms. 
Allele-specific probes arc often used in pairs, one member 

15 of a pair showing a perfect match to a reference form of a 
target sequence and the other member showing a perfect 
match to a variant form. Several pairs of probes can then be 
immobilized on the same support for simultaneous analysis 
of multiple polymorphisms within the same target sequence. 

20 2. Tiling Arrays 

The polymorphisms can also be identified by hybridiza- 
tion to nucleic acid arrays, some example of which are 
described by WO 95/11995 (incorporated by reference in its 
entirety for all purposes). One form of such arrays is 

25 described in the Examples section in connection with de 
novo identification of polymorphisms. The same array or a 
different array can be used for analysis of characterized 
polymorphisms. WO 95/11995 also describes subarrays that 
are optimized for detection of a variant forms of a prechar- 

30 acterized polymorphism. Such a subarray contains probes 
designed to be complementary to a second reference 
sequence, which is an allelic variant of the first reference 
sequence. The second group of probes is designed by the. 
same principles as described in the Examples except that the 

35 probes exhibit complementarily to the second reference 
sequence. The inclusion of a second group (or further 
groups) can be particular useful for analyzing short subse- 
quences of the primary reference sequence in which multiple 
mutations are expected to occur within a short distance 

40 commensurate with the length of the probes (i.e., two or 
more mutations within 9 to 21 bases). 
3. Allele-Specific Primers 

An allele-specific primer hybridizes to a site on target 
DNA overlapping a polymorphism and only primes ampli- 
45 fication of an allelic form to which the primer exhibits 
perfect complementarily. See Gibbs, Nucleic Acid Res, 17. 
2427-2448 (1989). This primer is used in conjunction with 
a second primer which hybridizes at a distal site. Amplifi- 
cation proceeds from the two primers leading to a detectable 
50 product signifying the particular allelic form is present. A 
control is usually performed with a second pair of primers, 
one of which shows a single base mismatch at the polymor- 
phic site and the other of which exhibits perfect comple- 
mentarily to a distal site. The single-base mismatch prevents 
55 amplification and no detectable product is formed. The 
method works best when the mismatch is included in the 
3'-most position of the oligonucleotide aligned with the 
polymorphism because this position is most destabilizing to 
elongation from the primer. See, e.g., WO 93/22456. 
60 4. Direct-Sequencing 

The direct analysis of the sequence of polymorphisms of 
the present invention can be accomplished using either the 
dideoxy chain termination method or the Maxam Gilbert 
method (see Sambrook et al., Molecular Cloning, A Labo- 
rs ratory Manual (2nd Ed.. CSHP, New York 1989); Zyskind 
et al., Recombinant DNA Laboratory Manual, (Acad. I*ress, 
1988)). 
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5. Denaturing Gradient Gel Electrophoresis 
Amplification products generated using the polymerase 

chain reaction can be analyzed by the use of denaturing 
gradient gel electrophoresis. Different alleles can be identi- 
fied based on the different sequence-dependent melting 
properties and electrophoretic migration of DNAin solution. 
Erlich, ed., PCR Technology, Principles and Applications 
forDNA Amplification, (W.H. Freeman and Co, New York, 
1992), Chapter 7. 

6. Single -Strand Conformation Polymorphism Analysis 
Alleles of target sequences can be differentiated using 

single-strand conformation polymorphism analysis, which 
identifies base differences by alteration in electrophoretic 
migration of single stranded PCR products, as described in 
Orita ct al., Proc. Nat, Acad. Sci. 86, 2766-2770 (1989). 
Amplified PCR products can be generated as described 
above, and heated or otherwise denatured, to form single 
stranded amplification products. Single-stranded nucleic 
acids may refold or form secondary structures which arc 
partially dependent on the base sequence. The different 
electrophoretic mobilities of single-stranded amplification 
products can be related to base-sequence difference between 
alleles of target sequences. 
III. Methods of Use 

After determining polymorphic form(s) present in an 
individual at one or more polymorphic sites, this informa- 
tion can be used in a number of methods. 



site. In diallelic loci, four genotypes arc possible: AA, AB, 
BA, and BB. If alleles A and B occur in a haploid genome 
of the organism with frequencies x and y, the probability of 
each genotype in a diploid organism are (see WO 95/12607): 



Homoiygote: p(AA)^ 



10 Singte HetcroTygote: /<A5)-p(flA)-xy-Jc<l-x) 

Both Hetero7ygot£s: p^B+B^)-2xy-2r(l-jr) 

The probability of identity at one locus (i.e, the probabil- 
ity that two individuals, picked at random from a population 
^5 \^dll have identical polymorphic forms at a given locus) is 
given by the equation: 



20 



A. Forensics 

Determination of which polymorphic forms occupy a set 
of polymorphic sites in an individual identifies a set of 
polymorphic forms that distinguishes the individual. See 30 
generally National Research Council, The Evaluation of 
Forensic DMA Evidence (Eds. Pollard et al.. National Acad- 
emy Press, DC, 1996). Since the polymorphic sites are 
within a 50,060 bp region in the human genome, the 
probability of recombination between these polymorphic 35 
sites is low. That low probability means the haplotype (the 
set of all 10 polymorphic sites) set forth in this application 
should be inherited without change for at least several 
generations. The more sites that are analyzed the lower the 
probability that the set of polymorphic forms in one indi- 40 
vidual is the same as that in an unrelated individual. 
Preferably, if multiple sites arc analyzed, the sites are 
unlinked. Thus, polymorphisms of the invention are often 
used in conjunction with polymorphisms in distal genes. 
Preferred polymorphisms for use in forensics are diallelic 4S 
because the population firequencies of two polymorphic 
forms can usually be determined with greater accuracy than 
those of multiple polymorphic forms at multi-allelic loci. 

The capacity to identify a distinguishing or unique set of 
forensic markers in an individual is useful for forensic 50 
analysis. For example, one can determine whether a blood 
sample from a suspect matches a blood or other tissue 
sample from a crime scene by determining whether the set 
of polymorphic forms occupying selected polymorphic sites 
is the same in the suspect and the sample. If the set of 55 
polymorphic markers does not match between a suspect and 
a sample, it can be concluded (barring experimental error) 
that the suspect was not the source of the sample. If the set 
of markers docs match, one can conclude that the DNA from 
the suspect is consistent with that found at the crime scene. 60 
If frequencies of the polymorphic forms at the loci tested 
have been determined (e.g., by analysis of a suitable popu- 
lation of individuals), one can perform a statistical analysis 
to determine the probability that a match of suspect and 
crime scene sample would occur by chance. 65 

p(ID) is the probability that two random individuals have 
the same polymorphic or allelic form at a given polymorphic 



These calculations can be extended for any number of 
polymorphic forms al a given locus. For example, the 
probability of identity p(ID) for a 3-allele system where the 
alleles have the firequencies in the population of x, y and z, 
25 respectively, is equal to the sum of the squares of the 
genotype frequencies: 



piIDyxU(:ixyf+(:iyzf+{2xz)^+^*+y* 

In a locus of n alleles, the appropriate binomial expansion 
is used to calculate p(ID) and p(exc). 

The cumulative probability of identity (cum p(ID)) for 
each of multiple unlinked loci is determined by multiplying 
the probabilities provided by each locus. 



cum piID)''p(IDl)p(ID2)p(lD3) . . i p(JDn) 

The cumulative probability of non-identity for n loci (i.e. 
the probability that two random individuals will be different 
at 1 or more lod) is given by the equation: 



cum p(nonID)»l-cum p{ID), 

If several polymorphic loci are tested, the cumulative 
probability of non-identity for random individuals becomes 
very high (e.g., one billion to one). Such probabilities can be 
taken into account together with other evidence in deter- 
mining the guflt or innocence of the suspect. 

B. Paternity Testing 

The object of paternity testing is usually to determine 
whether a male is the father of a child. In most cases, the 
mother of the child is known and thus, the mother's contri- 
bution to the child's genotype can be traced. Paternity 
testing investigates whether the part of the child's genotype 
not attributable to the mother is consistent with that of the 
putative father. Paternity testing can be performed by ana- 
lyzing sets of polymorphisms in the putative father and the 
child. 

If the set of polymorphisms in the child attributable to the 
father does not match the putative father, it can be 
concluded, barring experimental error, that the putative 
father is not the real father. If the set of polymorphisms in 
the child attributable to the father does match the set of 
polymorphisms of the putative father, a statistical calcula- 
tion can be performed to determine the probability of 
coincidental match. 
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The probability of parentage exclusion (representing the pancreas, prostate, skin, stomach and uterus. Phcnotj'pic 

probability that a random male will have a polymorphic traits also include characteristics such as longevity, appear- 

form at a given polymorphic site that makes him incompal- ance (e.g., baldness, obesity), strength, speed, endurance, 

ible as the father) is given by the equation (see WO fertility, and susceptibility or receptivity to particular drugs 

95/12607): 5 or therapeutic treatments. 

Correlation is performed for a population of individuals 

who have been tested for the presence or absence of a 

/<«c)«oy(i-jcy) phenotypic U-ait of interest and for polymorphic markers 

, J ^, i.*r -riiiAj sets. To perform such analysis, the presence or absence of a 

wherexandy are the population frequencies of alleles A and .ri t_- n \ l- - j 

r» f J- 11 1- t u' 10 set of polymorphisms (i.e. a polymorphic set) is determmed 

B of a diallelic polymorphic site. - : r ft. • j* -j. f e L 

/A* . • 11 1- / \ /I \ /t \ /I \ for a set of the individuals, some of whom exhibit a 

(At a triallelic site p(exc)*xy(l-xy)+yz(l-yz)+xz(l-xz) ... ... , r i_ Ln_-. i i r .i. . 

\\\ 1- J J L ^ J f particular trait, and some of which exhibit lack of the trait. 

+3xyz(l-xyz))), where x, y and z and the respective popu- in. « i c u i u* c *i. . *u 

, ' f 11 1 A n J ^\ The alleles of each polymorphism of the set are then 

lation frequencies of alleles A, B and C). • j * j . • a. *u l e 

„ L i_M-* r 7 ' reviewed to dc term me whether the presence or absence of a 

The probability of non-cxclusion is ^ ... « , • • . j -A. .t. » ** c • . * 
r ^ .15 particular allele is associated with the trait of interest. 

Correlation can be performed by standard statistical methods 
pinon-excyi-piccc) ^ch as a K-squared lest and statistically significant corre- 

lations between polymorphic form(s) and phenotypic char- 
The cumulative probability of non-exclusion acteristics are noted. For example, it might be found that the 
(representing the value obtained when n loci are used) is 20 presence of allele Al at polymorphism A correlates with 
thus: heart disease. As a further example, it might be found that 

the combined presence of allele Al at polymorphism A and 
allele Bl at polymorphism B correlates with increased milk 

cum p{n(m~cxcyp{mri^excr^non-excl)p{non-exc3) . . . p{pon- production of a farm animal. 

25 Such correlations can be exploited in several ways. In the 
The cumulative probability of exclusion for n loci ^ase of a strong correlation between a set of one or more 
(representing the probability that a random male will be polymorphic forms and a disease for which treatment is 
excluded) available, detection of the polymorphic form set in a human 

or animal patient may justify immediate administration of 
30 treatment, or at least the institution of regular monitoring of 
am p{exc)'i-cum p{non-exc). the patient. Detection of a polymorphic form correlated with 

serious disease in a couple contemplating a family may also 
If several polymorphic loci are included in the analysis, be valuable to the couple in their reproductive decisions. For 
the cumulative probability of exclusion of a random male is example, the female partner might elect to undergo in vitro 
very high. This probability can be taken into account in 35 fertilizalion to avoid the possibility of transmitting such a 
assessing the liability of a putative father whose polymor- polymorphism from her husband to her offspring. In the case 
phic marker set matches the child's polymorphic marker set of a weaker, but still statistically significant correlation 
attributable to his/her father. between a polymorphic set and human disease, immediate 

C. Correlation of Polymorphisms with Phenotypic Traits therapeutic intervention or monitoring may not be justified. 
The polymorphisms of the invention may contribute to the 40 Nevertheless, the patient can be motivated to begin simple 
phenotype of an organism in different ways. Some polymor- life-style changes (e.g., diet, exercise) that can be accom- 
phisms occur within a protein coding sequence and contrib- plished at little cost to the patient but confer potential 
ute to phenotype by affecting protein stmcturc. The effect benefits in reducing the risk of conditions to which the 
may be neutral, beneficial or detrimental, or both beneficial patient may have increased susceptibility by virtue of variant 
and detrimental, depending on the circumstances. For 45 alleles. Identification of a polymorphic set in a patient 
example, a heterozygous sickle cell mutation confers resis- correlated with enhanced receptiveness to one of several 
tance to malaria, but a homozygous sickle cell mutation is treatment regimes for a disease indicates that this treatment 
usually lethal. Other polymorphisms occur in noncoding regime should be followed. 

regions but may exert phenotypic effects indirectly via For animals and plants, correlations between charactcris- 
infiuence on replication, transcription, and translation. A 50 tics and phenotype are useful for breeding for desired 
single polymoiphian may affect more than one phenotypic characteristics. For example, Beitz et al., U.S. Pat No. 
trait. Likewise, a single phenotypic trait may be affected by 5,292,639 discuss use of bovine mitochondrial polymor- 
polymorphisms in different genes. Further, some polymor- phisms in a breeding program to improve milk production in 
phisms predispose an individual to a distinct mutation that is cows. To evaluate the effect of mtDNA D-loop sequence 
causally related to a certain phenotype. 55 polymorphism on milk production, each cow was assigned 

Phenotypic traits include diseases that have known but a value of 1 if variant or 0 if wildtype with respect to a 
hitherto unmapped genetic components. Phenotypic traits prototypical mitochondrial DNA sequence at each of 17 
also include symptoms of, or susceptibility to, multifactorial locations considered. Each production trait was analyzed 
diseases of which a component is or may be genetic, such as individually with the following animal model: 
autoimmune diseases, inflammation, cancer, diseases of the 60 
nervous system, and infection by pathogenic microorgan- 
isms. Some examples of autoimmune diseases include rheu- >'tfi^-?i+l^/+'V+^*+Pi+ • • • ^i^P^.^n^p 
matoid arthritis, multiple sclerosis, diabetes (insulin- where ^tjknp ^ milk, fat, fat percentage, SNF, SNF 
dependent and non-independent), systemic lupus percentage, energy concentration, or lactation energy record; 
erythcmaosus and Graves disease. Some examples of can- 65 /* is an overall mean; YS, is the effect common to all cows 
cers include cancers of the bladder, brain, breast, colon, calving in year-season; Xj^ is the effect common to cows in 
esophagus, kidney, leukemia, liver, lung, oral cavity, ovary, either the high or average selection line; Pj to p^, are the 
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binomial regressions of production record on mtDNA 
D-loop sequence polymorphisnas; PE„ is permanent envi- 
ronmental effect common to all records of cow n; a„ is effect 
of animal n and is composed of the additive genetic contri- 
bution of sire and dam breeding values and a Mendelian 
sampling effect; and e^ is a random residual. It was found 
that eleven of seventeen polymorphisms tested influenced at 
least one production trait. Bovines having the best polymor- 
phic forms for milk production at these eleven loci are used 
as parents for breeding the next generation of the herd. 

D. Genetic Mapping of Phenotypic Traits 

The previous section concerns identifying correlations 
between phenotypic traits and polymorphisms that directly 
or indirectly contribute to those traits. The present section 
describes identification of a physical linkage between a 
genetic locus associated with a trait of interest and poly- 
morphic markers that are not associated with the trait, but 
are in physical proximity with the genetic locus responsible 
for the trait and co-segregate with it. Such analysis is useful 
for mapping a genetic locus associated with a phenotypic 
trait to a chromosomal position, and thereby cloning gene(s) 
responsible for the trait. See Lander et aI.,P/i£>c. Nad. Acad. 
Sci. (USA) 83, 7353-7357 (1986); Lander et al., Proc. Natl. 
Acad. Sci. (USA) 84, 2363-2367 (1987); Donis-KeUer et al.. 
Cell 51, 319-337 (1987); Under et al.. Genetics 121. 
185-199 (1989)). Genes localized by linkage can be cloned 
by a process known as directional cloning. See Wainwright, 
Med. J. Australia 159, 170-174 (1993); Collins, Nature 
Genetics 1, 3-6 (1992) (each of which is incorporated by 
reference in its entirety for all purposes). 

Linkage studies are typically performed on members of a 
family. Available members of the family are characterized 
for the presence or absence of a phenotypic trait and for a set 
of polymorphic markers. The distribution of polymorphic 
markers in an informative meiosis is then analyzed to 
determine which polymorphic markers co-segregate with a 
phenotypic trait. See, e.g., Kerem el al., Science 245, 
1073-1080 (1989); Monaco et al. Nature 316. 842 (1985); 
Yamoka et aI, Neurology 40, 222-226 (1990); Rossitcr et 
al., FASEB Journal 5. 21-27 (1991). 

Linkage is analyzed by calculation of LOD (log of the 
odds) values. A lod value is the relative likelihood of 
obtaining observed segregation data for a marker and a 
genetic locus when the two are located at a recombination 
fraction 6, versus the situation in which the two are not 
linked, and thus segregating independently (Thompson & 
Thompson. Genetics in Medicine (5th ed, W.B. Saunders 
Company, Philadelphia, 1991); Strachan, "Mapping the 
human genome" in The Human Genome (BIOS Scientific 
Publishers Ltd, Oxford). Chapter 4). A series of likelihood 
ratios are calculated at various recombination fractions (6). 
ranging from 6«0.0 (coincident loci) to 6»0.50 (unlinked). 
Thus, the likelihood at a given value of 6 is: probability of 
data if loci linked at B to probability of data if loci unlinked. 
The computed likelihoods are usually expressed as the logjo 
of this ratio (i.e., a lod score). For example, a lod score of 
3 indicates 1000:1 odds against an apparent observed link- 
age being a coincidence. The use of logarithms allows data 
collected £rom different families to be combined by simple 
addition. Computer programs are available for the calcula- 
tion of lod scores for differing values of 6 (e.g., UPED, 
MLINK (Lathrop, Proc. Nat. Acad. Sci. (USA) 81, 
3443-3446 (1984)). For any particular lod score, a recom- 
bination fraction may be determined firom mathematical 
tables. Sec Smith et al. Mathematical tables for research 
workers in human genetics (Churchill, London, 1961); 
Smith. A/i/L Hum. Genet. 32, 127-150 (1968). The value of 
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0 at which the lod score is the highest is considered to be the 
best estimate of the recombination fraction. 

Positive lod score values suggest that the two loci are 
linked, whereas negative values suggest that linkage is less 

5 likely (at that value of 9) than the possibility that the two loci 
are unlinked. By convention, a combined lod score of +3 or 
greater (equivalent to greater than 1000:1 odds in favor of 
linkage) is considered definitive evidence that two loci are 
linked. Similarly, by convention, a negative lod score of -2 

10 or less is taken as definitive evidence against linkage of the 
two loci being compared. Negative linkage data arc useful in 
excluding a chromosome or a segment thereof from consid- 
eration. The search focuses on the remaining non-excluded 
chromosomal locations. 

15 IV. Modified Polypeptides and Gene Sequences 

The invention fiirther provides variant forms of nucleic 
acids and corresponding proteins. The nucleic acids com- 
prise at least ten contiguous bases of one of the sequences 
described in TABLE 2 (SEQ ID N0S:1 and 2). TABLE 3 

20 (SEQ ID N0S:3 and 4), TABLE 4 (SEQ ID N0S;5 and 6). 
TABLE 5 (SEQ ID NOS:7-15), TABLE 6 (SEQ ID NOS:16 
and 17), TABLE 7 (SEQ ID N0S:18 and 19), TABLE 8 
(SEQ ID NOS:20 and 21), TABLE 9 (SEQ ID NOS:22-24), 
TABLE 10 (SEQ ID NOS:25-27) and TABLE 11 (SEQ ID 

25 NOS:28 and 29). designated Ml-MlO. Some nucleic acid 
encode full-length variant forms of proteins. Similarly, vari- 
ant proteins have the prototypical amino acid sequences of 
encoded by nucleic acid sequence shown in Tables 2-11, 
designated Ml-MlO (read so as to be in-frame with the 

30 full-length coding sequence of which it is a component). 
Variant genes can be expressed in an expression vector in 
which a variant gene is operably linked to a native or other 
promoter. Usually, the promoter is a eukaryotic promoter for 
expression in a mammalian cell. The transcription regulation 

35 sequences typically include a heterologous promoter and 
optionally an enhancer which is recognized by the host. The 
selection of an appropriate promoter, for example trp, lac, 
pha^e promoters, glycolytic enzyme promoters and tRNA 
promoters, depends on the host selected. Commercially 

40 available expression vectors can be used. Vectors can 
include host-recognized replication systems, amplifiable 
genes, selectable markers, host sequences useful for inser- 
tion into the host genome, and the like. 
The means of introducing the expression construct into a 

45 host cell varies depending upon the particular construction 
and the target host. Suitable means include fusion, 
conjugation, transfection, transduction, electroporation or 
injection, as described in Sambrook. supra. A wide variety of 
host cells can be employed for expression of the variant 

50 gene, both prokaryotic and eukaryotic. Suitable host cells 
include bacteria such as E. coli^ yeast, filamentous fungi, 
insect cells, mammalian cells, typically immortalized, e.g.. 
mouse. CHO, human and monkey cell lines and derivatives 
thereof. Preferred host cells are able to process the variant 

55 gene product to produce an appropriate mature polypeptide. 
Processing includes glycosylation, ubiquitination, disulfide 
bond formation, general post-translational modification, and 
the like. 

The protein may be isolated by conventional means of 
60 protein biochemistry and purification to obtain a substan- 
tially pure product, i.e., 80, 95 or 99% free of cell component 
contaminants, as described in Jaooby, Methods in Enzymol- 
ogy Volume 104. Academic Press. New York (1984); 
Scopes, Protein Purification, Principles and Practice, 2nd 
65 Edition, Springer- Verlag, New York (1987); and Deutschcr 
(ed). Guide to Protein Purification, Methods in Enzymology, 
Vbl. 182 (1990). If the protein is secreted, it can be isolated 
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from the supernatant in which the host cell is grown. If not EXAMPLES 

secreted, the protein can be isolated from a lysate of the host 

cells. 

The invention further provides transgenic nonhuman ani- The polymorphisms set forth in this application were 
mals capable of expressing an exogenous variant gene 5 identified by hybridization to tiling arravs. TiUng arrays are 

and/or haying one or both aUeles of an endogenous variant described in PCT/US94/12305 (incoiporaled by reference in 

gene inactivated. Expression of an exogenous vanant gene - ■ . r « x 

is usually achieved by operably linking the gene to a ^^^^^^^^^ purposes). Tiling generally means the 

promoter and optionally an enhancer, and microinjecting the synthesis of a defined set of oligonucleotide probes that is 

construct into a zygote. See Hogan el al., "Manipulating the made up of a sequence complementary to the sequence to be 

Mouse Embryo, A Laboratory Manual," Cold Spring Haibor analyzed (the "target sequence"), as well as preselected 

Laboratory. Inactivation of endogenous variant genes can be variations of that sequence. The variations usually include 

achieved by forming a Iransgcne in which a cloned variant substitution at one or more base positions with one or more 

gene is inactivated by insertion of a positive selecUon , i-r ♦ . * j. «7i-. nr-M-^^n/r 

marker. See Capecchi. Science 244, 1288-^292 (1989). Hie ^.^cleotides. lilmg strategies are discussed m WO 95/11995 
transgene is then introduced into an embryonic stem cell, 15 (incorporated by reference m its entirety for all purposes), 

where it undergoes homologous recombination with an With a tiled array containing 4L probes one can query every 

endogenous variant gene. Mice and other rodents are pre- position in a nucleotide containing L number of bases. A 4L 

ferred animals. Such animals provide useful drug screemng tiled array, for example, contains L number of sets of 4 

systems. . ^ probes, i.e. 4L probes. Each set of 4 probes contains the 

.viLt^fhvv/H.n?«f?''^i ^ full-length polypeptides 20 perfect complement to a portion of the target sequence with 
expressed by vanant genes, the present invention includes . .. , ^ . . 

biologically active fragments of the polypeptides, or analogs !" ^^6^*^ substituUon for each nucleoUde at the same posiUon 
thereof, including orgam'c molecules which simulate the ^ probe. See also Chee, M., et. al.. Science, October, 
interactions of the peptides. Biologically active fragments 1996. 
include any portion of the full-length polypeptide which 25 

confers a biological function on the variant gene product, . . . , . , , , • - 

including ligand binding, and antibody binding. Ligand ^^^^^^ ^^^S*^^ polymorphic sites 

binding includes binding by nucleic acids, proteins or provided m this apphcation, wc designed a P^-" (25-mer 
polypeptides, small biologically active molecules, or large probes having the interrogation position at base 13) 4L tiling 
cellular stmctures. array for the G6PD locus. Because the G6PD locus contains 

Polyclonal and/or monoclonal antibodies that specifically a large number of Alu sequences (repeat sequences), we 
bind to variant gene products but not to corresponding simplified the tiled probe array by not probing the repetitive 
prototypical gene products are also provided. Antibodies can ai t * * * 

f ^- .u ^ • 1 %u *u • * Alu sequences. To generate target sequence fragments, 

be made by injectmg mice or other animals with the vanant ui j n . j r ir. • j- -j it 
gene product or synthetic peptide fragments thereof Mono- ^^"^ "^^^^^^ mdividuals. l^ng range PCR 
clonal antibodies are screened as are described, for example, 35 amplification was earned out on genomic DNA. The ampli- 
iaUax\ov/&L&nc,Antibodies, A Laboratory Manual, Cold cons were labeled, fragmented, and used to determine 
Spring Harbor Press, New York (1988); Coding, Mono- hybridization to the array. 
clonal antibodies. Principles and Practice (2d ed.) Aca- 
demic Press, New York (1986). Monoclonal antibodies are aii ««Ki,v-»f;^«e r««fV ^ - 

. . J r -e • • • i_ . All publications and patent applications cited above are 

tested for specific immunoreactmty with a vanant gene 40- . j u e • .t. * . r n 

product and lack of immunoreactivity to the corresponding I^^^T'^"^^^ by reference in their entirety for aU purposes to 
prototypical gene product. Tliese antibodies are useful in ^''^^''^ ^ ^^.f^?, individual publication or patent 

diagnostic assays for detection of the variant form, or as an WhcaUon were specifically and individually mdicated to be 
active ingredient in a pharmaceutical composiuon. so incorporated by reference. Although the present invention 

V. Kits 45 described in some detail by way of iUusUation and 

The invention further provides kits comprising at least example for purposes of clarity and understanding, it will be 
one allele-specific oligonucleotide as described above. apparent that certain changes and modifications may be 
Often, the kits contain one or more pairs of allele-specific practiced within the scope of the appended claims, 
oligonucleotides hybridizing to different forms of a poly- 
morphism. In some kits, the allele-specific oligonucleotides 50 TABLE 1 
are provided immobilized to a substrate. For example, the 
same substrate can comprise allele -^)ecific oligonucleotide 
probes for detecting at least 10, 100 or all of the polymor- 
phisms shown in Tables 2-11. Optional additional compo- 
nents of the kit include, for example, restriction enzymes, 55 
reverse-transcriptasc or polymerase, the substrate nucleo- 
side triphosphates, means used to label (for example, an 
avidinenzyme conjugate and enzyme substrate and chro- 
mogen if the label is biotin), and the appropriate buffers for 
reverse transcription, PCR, or hybridization reactions. 60 
Usually, the kit also contains instructions for carrying out the 
methods. 
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TABLE 2 

SEQ ID NO: 

starting sequence TGAGCAACAGTGGAAATTTG 1 



Ml TGAGCAACAGTGGAAATTTG 1 

MIO TGAGCAACAGTGGAAATTTG 1 

M2 TGAGCAACAATOGAAATTTG 2 

M3 TGAGCAACAATGGAAATTTG 2 

M4 TGAGCAACAATGGAAATTTG 2 

M5 TGAGCAACAGTGGAAATTTG 3 

M6 TGAGCAACAATGGAAATTTG 2 

M7 TGAGCAACAGTGGAAATTTG 3 

M8 TGAGCAACAGTGGAAATTTG 3 

M9 TGAGCAACAATGGAAATTTG 2 



TABLE 3 

SEQ ID NO: 

starting sequence GCAGTTTGAGTGTCTCTGGT 3 



Ml GCAGTTTGAGTGTCTCTGGT 3 

MIO GCAGTTTGAGTGTCTCTGGT 3 

M2 GCAGTTTGAATGTCTCTGGT 4 

M3 GCAGTTTGAATGTCTCTGGT 4 

M4 GCAGTTTGAATGTCTCTGGT 4 

MS GCAGTTTGAGTGTCTCTGGT 3 

M6 GCAGTTTGAATGTCTCTGGT 4 

M7 GCAGTTTGAGTGTCTCTGGT 3 

M8 GCAGTTTGAGTGTCTCTGGT 3 

M9 GCAGTTTGAATGTCTCTGGT 4 



TABLE 4 



SEQ ID NO: 



starting sequence GTAAATGCTCTGCAAATAAC 



Ml GTAAATGCTCTGCAAATAAC 

MIO GTAAATGCTCTGCAAATAAC 

M2 GTAAATGCTCTGCAAATAAC 

M3 GTAAATGCTCTGCAAATAAC 

M4 GTAAATGCTCTGCAAATAAC 

M5 GTNANTNCTGNOCAANTANC 

M6 GTAAATGCTCTGCAAATAAC 

M7 GTAAA TGCTCTGCAAATAAC 

M8 GTAAATGCTCTGCAAATAAC 

M9 GTAAATGCTCTGCAAATAAC 



TABLES 



SEQ ID NO: 



starting sequence 
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TABLE 6 



SEQ ID NO: 



starting sequcacc GACCTCTTTAGCTCGTTATT 16 

Ml GACCTCTTTTGCTCGTTATT 17 

MIO GACCTCTTTAGCTCGTTATT 16 

M2 GACCTCTTTAGCTCGTTATT 16 

M3 GACCTCTTTAGCTCGTTATT 16 

M4 GACCTCTTTAGCTCGTTATT 16 

M5 GACCTCTTTAGCTCGTTATT 16 

M6 GACCTCTTTAGCTCGTTATT 16 

M7 GACCTCTTTAGCTCGTTATT 16 

M8 GACCTCTTTAGCTCGTTATT 16 

M9 GACCTCTTTAGCTCGTTATT 16 



15 



TABLE? 



SEQ ID NO: 



starting sequence GGGCCTCAAOATTTGATTTC 18 

Ml GGGCCTCAAGATTTGATTTC 18 

MIO GGGCCTCAAGATTTGATTTC 18 

M2 GGGCCTCANTTTTTGATTTC 19 

M3 GGGCCTCANTTTTTGATTTC 19 

M4 GGGCCTCAAGATTTGATTTC 18 

M5 GGGCCTCAAGATTTGATTTC 18 

M6 GGGCCTCAAGATTTGATTTC 18 

M7 GGGCCTCAAOATTTGATTTC 18 

M8 GGGCCTCAAGATTTGATTTC 18 

M9 GGGCCTCAAGATTTGATTTC 18 



TABLES 



SEQ ID NO: 



starting sequence AGGGGGGCTTTTTCCAGCTC 20 

Ml AGGGGGGCTCTTTCCAGCTC 21 

MIO AGGGGGGCTTTTTCCAGCTC 20 

M2 AGGGGGGCTCTTTCCAGC TC 21 

M3 AGGGGGGCTCTTTCCAGCTC 21 

M4 AGGGGGGCTCTTTCCAGCTC 21 

M5 AGGGGGGCTCTTTCCAGCTC 21 

M6 AGGGGGGCTCTTTCCAGCTC 21 

M7 AGGGGGGCTCTTTCCAGCTC 21 

M8 AGGGGGGC TCTTTCCAGCTC 21 

M9 AOGOOGOCTCTTTCCAOCTC 21 



TABLE 9 



SEQ ID NO: 



stajting sequence GCCTCCTTCOTTCTACOACA 22 

Ml GCCTCCTTCOTTCTACOACA 22 

MIO GCCTCCTTCOTTCTACOACA 22 

M2 GCCTCCTTCOTTCTACOACA 22 

M3 GCCTCCTTCOTTCTACOACA 22 

M4 GCCTCCTTCOTTCTACOACA 22 

M5 GCCTCCTTCOTTCTACOACA 22 

M6 OCCTCCTTNATTCTACOACA 23 

M7 GCCTCCTTCOTTCTACOACA 22 

M8 GCCTCCTTCOTTCTACOACA 22 

M9 GCCTCCTTCATTCTACOACA 24 
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19 20 

TABLE 10 



SEQ ID NO: 



Starting sequence 
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TABLE 11 

SEQ ID NO: 



starting sequence 
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SEQUENCE LrSTlNG 



( 1 ) GENERAL INFORMAnON: 

(Mi) NUMBER OF SEQUENCES: 29 



( 2 ) [NTORMAnON FOR SEQ ID NO:l: 

( ] ) SEQUENCE CHARACTERISncS: 
( A ) LENGTH: 20 bue pain 
( B ) TYPE: nudeic tcid 
( C ) STRANDEDNESS: fingle 
( D ) TOPOLOGY: Imev 

( i i ) MOLECULE TYPE: DNA 

( X i ) SEQUENCE DESCRIPTION: SEQ ID NO:l: 

TGAOCAACAG TGGAAATTTG 20 



( 2 ) INPORMAnON FOR SEQ ID N0:2: 

( i ) SEQIiENCE CHARACTERISTICS: 
( A ) LENCTTH: 20 bate {afis 
( B) TYPE; nnddc Kid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i t ) MOLECULE TYPE: DNA 

( X I ) SEQUENCE DESCRIPTION: SEQ ID N0:2: 

TCAGCAACAA TGGAAATTTG 



( 2 ) INFORMAnON FOR SEQ ID NOJ: 



( I ) SEQUENCE CHARACTEROTCS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: nodeie »dd 
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( C ) STEUNDEDNESS: single 
{ D ) TOPOLOCfY: Imcar 

( I i ) MOLECULE TYPE: DNA 

( X \ ) SEQUENCE DESCRIPTION: SEQ ID N0:3: 

GCAGTTTGAG TGTCTCTGGT 20 



( 2 ) INFORMAnON FOR SEO ID N0:4: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 b*se pain 
( B )TYPE: nndeic idd 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: Ibcaj 



( i i ) MOLECULE TYPE: DNA 



( X i ) SEQUENCE DESCRIPTION: SEQ ID N0:4: 
CCAGTTTGAA TGTCTCTGGT 



( 2 ) INFORMAnCN FOR SEQ ID N0:5: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 b»»e pain 
( B ) TYPE: andeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA 

( X i ) SEQUENCE DESCRIPTION: SEQ ID N0:5: 

GTAAATGCTC TGCAAATAAC 



( 2 ) INFORMAnON FOR SEQ ID H0:6: 

( i ) SEQUENCE CHARACTCRISncS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE; nncleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE; DNA 

( X i ) SEQUENCE DESCRIPTION: SEQ ID N0:6: 

GTNANTNCTG NGCAANTANC 20 



( 2 ) INFORMAnON FOR SEQ ID N0:7; 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pain 
( B } TYPE: nucleic add 
( C ) STRANDEDNESS: single 
< D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA 

( X i ) SEQUENCE DESCRIFnON: SEQ ID NO:7: 



( 2 ) INFORMAnON FOR SEQ ID NO& 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: nndeic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 



( i i ) MOLECULE TYPE; DNA 
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( X i ) SEQUEXCE DESCRIPTEON: SEQ ID K0:8: 
OGCTCCAAGC GGTGCNCGNC 



2 0 



( 2 ) INFORMATION FOR SEQ ID N0:9: 

( i ) SEQUENCE CHARACTERISTICS: 
( A)LENGTH:20ba»epain 
( B )TyFE:m»deic add 
( C ) STRANDEDNESS: stogie 
( D ) TOPOLOGY: Imwx 

( i i ) MOLECULE TYPE: DNA 

( X I ) SEQUENCE DESCRIPTION: SEQ ID N0:9: 

GGCTCCAAGC CGTGNNCGNC 



{ 2 ) INFORMATION FOR SEQ ID NO:10: 

( i ) SEQUENCE CHARACIERISnCS: 
( A ) LENGTH: 20 btse patn 
( B ) TYPE: iiacteic acid 
( C ) STKANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i ( ) MOLECULE TYPE: DNA 

( X I ) SEQUENCE DESCRIPTION: SEQ ID NO:L0: 

GGCTCCAAGC GGTGCNNGNC 



2 0 



( 2 ) INPORMAnON FOR SEQ ID NO:ll: 

( i ) SEQUENCE CHARACTERtSnCS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA 

( X i ) SEQUENCE DESCRIPTION: SEQ ID NO:U: 

GGCTCCAAGC GGTGCCCONC 



2 0 



( 2 ) INFORMATION FOR SEQ ID N0:12: 

( I ) SEQUENCE CHARACTERISnCS: 
( A ) LENGTH: 20 hue pain 
( B )TYm nucleic acid 
( C ) STRANDEDNESS: sin^e 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA 

( X i ) SEQUENCE DESCRIPTION: SEQ ID N0:12: 

GGCTCCAAGC GGTGCNNGGN 



2 0 



( 2 ) INFORMAnON FOR SEQ ID NOU: 

( i ) SEQUENCE CHARACTERISTICS: 
( A )LENCTH:20ba»e pain 
( B )TYFE:nitdeicocld 
( C ) STRANDEDNESS: shgte 
( D ) TOPOLOGY: linear 

( I \ ) MOLECULE TYPE DNA 

( X I ) SEQUENCE DESCRIPTION: SEQ ID N0:13: 



GGCTCNNNNT 



NGNNNNNNNC 
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( 2 ) INFORMAnON FOR SEQ ID NO:U: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pab* 
( B ) TYPE: nadeic acid 
( C ) STRANDEDKESS: single 
( D ) TOPOLOGY: lincsi 

( ! I ) MOLECULE TYPE: DNA 

( )t i ) SEQUENCE DESCRIPnON: SEO ID K0:14: 

GGCTCCAAGC GGTGCNCGCC 



( 2 ) INFORMATION FOR SEQ ID N0:l5: 

( 1 ) SEQUEMCE CHARACTERISnCS: 
( A ) LENGTH: 20 base paits 
( B ) TYPE: nucleic scid 
( C ) STRANDEDNESS: srofle 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA 

( X i ) SEQUENCE DESCRIPTION: SEQ ID N0:15: 

GGCTCCNAAT GGTNNNNGNC 20 



( 2 ) INFORMATION FOR SEQ ID N0:16: 

( I ) SEQUENCE CHARACIERrsnCS: 
( A ) LENGTH: 20 base pahs 
( B ) TYPE: nnclcic bad 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA 

( X i ) SEQLTENCE DESCRBTION: SEQ ID N0:16: 



( 2 ) INFORMAnON FOR SEQ ID N0:17: 

( I ) SEQUENCE CHARACTERlSnCS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: nnclcic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA 

( I i ) SEQUENCE DESCRIPTION: SEQ ID N0:17: 



( 3 ) INFORMAnON FOR SEQ ID NO:l8: 

( i ) SEQUENCE CHARACIERISnCS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: nndeic add 
( C ) STRANDEDNESS: singje 
( D ) TOPOLOGY: Unear 

( i i ) MOLECULE TYPE: DNA 

( K i ) SEQUENCE DESCRirnON: SEQ ID N0:18: 

GGOCCTCAAC ATTTGATTTC 



( 2 ) INFORMAHON FOR SEQ ID N0:19: 



( i ) SEQXiENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: nnclelc add 



27 



5,856,104 

-continued 



28 



( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA 

( X i ) SEQUENCE DESCRrPTICN: SEQ ED NOrlft 

OGGCCTCANT ATTTGATTTC 

( 2 ) INFORMAnON FOR SEO ID NO:20: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: noctelc add 
( C ) HANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA 

( X i ) SEQUENCE DESCRIPTION: SEQ ID NO:20: 

AGGGGGGCTT TTTCCAGCTC 



( 2 ) INFORMAnON FOR SEQ ID N0:21: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pahs 
( B )TYPE:nDdeic acid 
( C ) STRANDEDNESS: siogte 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TVPE: DNA 

( X i ) SEQUENCE DESCRIPTION: SEQ ID N0:21: 

AGGGGGGCTC TTTCCAGCTC 



2 0 



( 2 ) INFORMAnON FOR SEQ ID NO:22: 

( i ) SEQUENCE CHARACreRISnCS: 
C A ) LENGTH: 20 bi5c pain 
( B )TyPE: nucleic acid 
( C ) STRANDEDNESS: sfaifile 
( D ) TOPOLOGY: linear 

( i t ) MOLECULE TYPE: DNA 

( ji i ) SEQUENCE DESCRIPnON: SEQ ID NO:22: 

GCCTCCTTCC TTCTACGACA 



( 2 ) INFORMAnON FOR SEQ ID NO-J3: 

( t ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: Dttdek add 
( C ) STJILANDEDNESS: single 
( D ) TOPOLOGY tioear 

( i I )MOl£CUl£ TYPE: DNA 

( X i ) SEQUENCE DESCRIPTION: SEQ ZD KO-^: 

GCCTCCTTNA TTCTACGACA 20 



( 2 ) INFORMAnON FOR SEQ ID NO-^4: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: nndek add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: Uoear 



( i i ) MOLECULE rvm DNA 
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( X i ) SEQUENCE DESCRIPTION: SEQ ID NO:24: 
OCCTCCTTCA TTCTACGACA 



2 0 



( 2 ) INFORMATION FOR SEQ ID NO:25: 

< I ) SEQUENCE CHARACTERISTICS: 
( A )LENCrrH:20ba« pabs 

( C ) STRANDEDNESS: »ingl« 
(D )T0POLOGY: linear 

( I i ) MOLECULE TYPE: DNA 

( X i ) SEQUENCE DESCRIPTION: SEQ ID NO:25: 

AGOGTCCCCG TCCTCACCTC 



2 0 



( 2 ) INFORMATION FOR SEQ ID NO:26: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pain 
( B )TirFE:DncIeicacid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA 

( X i ) SEQUENCE DESCRIPTION: SEQ ID KO:26: 

AGNGTNCGNA TCCTCACCTG 



( 2 ) INFORMAnON FOR SEQ ID NO:27: 

( i ) SEOl/ENCE CHARACTEROTCS: 
( A ) LENGTH: 20 base pain 
( B )TVPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA 

( X I ) SEQUENCE DESCRIPTION: SEQ ID NO:27: 

NGGGTGCGCG TCCTCANCTG 



( 2 ) INFORMAnON FOR SEQ ID NO:28: 

( i ) SEQUENCE CHARACTERISnCS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D )T0POLOCY: linear 

( i i ) MOLECULE TYFE: DNA 

( X 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:28: 

AACCAOAATT TATTTTGAGG 20 



( 2 ) INPORMAnON FOR SEQ ID NO:29: 

( I ) SEQUENCE CHARACIERiynCS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: nndek odd 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: Ibear 

( i i ) MOLECULE TYPE: DNA 

( X I ) SEQUENCE DESCRIPTtON: SEQ ID KO:29: 



AACCAOAATG TATTTTGAGG 
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What is claimed is: 

1. An isolated nucleic acid segment of between 10 and 
100 bases of which at least 10 contiguous bases including a 
polymorphic site are torn a sequence selected from the 
group consisting of SEQ ID N0S:l-5, SEQ ID No:5 in 5 
which the C at position 10 is replaced by a G, SEQ ID No:7, 
SEQ ID No:7 in which the C at position 10 is replaced by 

a T, SEQ ID Nos:16-18, SEQ ID No:18 in which the G at 
position 10 is replaced by a T, SEQ ID Nos:20-22, SEQ ID 
Nos:24 and 25, SEQ ID No:25 in which the G at position 10 
is replaced by an A, SEQ ID Nos:28 and 29, and the perfect 
complements thereof^ wherein the polymorphic site occurs 
at position 10 in each of the SEQ. ID Nos. 

2. The isolated nucleic acid segment of claim 1 that is 
DNA. 

3. The isolated nucleic acid segment of claim 1 that is 
RNA. 

4. The isolated nucleic acid segment of claim 1 that is less 
than 50 bases. 

5. The isolated nucleic acid segment of claim 1 that is less 
than 20 bases. 20 

6. The isolated nucleic acid segment of claim 1, wherein 
the ten contiguous bases are from a sequence selected &om 
the group consisting of SEQ ID Nos:2 and 4, SEQ. ID. No:5 
in which the C at position 10 is replaced by a G, SEQ ID 
No:7 in which the C at position 10 is replaced by a T, SEQ 25 
ID. No: 17, SEQ ID No: 18 in which the G at position 10 is 
replaced by a T, SEQ ID Nos:21 and 24, SEQ ID NO:25 in 
which the G at position 10 is replaced by an A, SEQ ID 
No:29, and the perfect complements thereof, wherein the 
polymorphic site occurs at position 10 in each of the SEQ ID 
NOS. ^° 

7. The isolated nucleic acid segment of claim 1, which is 
a probe, and wherein the polymorphic site occupies a central 
position of the pipbe. 

8. The nucleic acid of claim 1, which is a primer and, 
wherein the polymorphic site occupies the 3' end of the 35 
primer. 

9. An isolated nucleic acid fragment of a human X 
chromosome comprising at least 10 contiguous bases includ- 
ing a polymorphic site from a sequence selected from the 
group consisting of SEQ ID Nos:2 and 4, SEQ ID No:5 in 40 
which the C at position 10 is replaced by a G, SEQ ID No:7 

in which the C at position 10 is replaced by a T, SEQ ID 
No:17, SEQ. ID. No:18 in which the G at position 10 is 
replaced by a T, SEQ ID N0S:21 and 24, SEQ ID NO:25 in 
which the G at position 10 is replaced by an A, SEQ ID 4S 
NO:29, and the perfect complements thereof wherein the 
polymorphic site occurs at position 10 in each of the SEQ ID 
NOS^ 
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10. The isolated nucleic acid fragment of a human 
X-chromosome of claim 9, comprising a sequence selected 
from the group consisting of SEQ ID N0S:2 and 4, SEQ ID 
N0:5 in which the C at position 10 is replaced by a G, SEQ 
ID NO: 7 in which the C at position 10 is replaced by a X 
SEQ ID NO: 17, SEQ. ID NO: 18 in which the G at position 
10 is replaced by a T, SEQ ID N0S:21 and 24, SEQ ID NO: 
25 in which the G at position 10 is replaced by an A, SEQ 
ID NO: 29, and the perfect complements thereof, wherein 
the polymorphic site occurs at position 10 in each of the 
SEQ, ID NOS. 

11. A method of determining a base occupying a poly- 
morphic site in a nucleic acid, comprising: 

obtaining the nucleic acid from an individual; and 
determining a base occupying a polymorphic site in a 
sequence selected firom the group consisting of SEQ ID 
N0S:2 and 4, SEQ ID N0:5 in which the C at position 
10 is replaced by a G, SEQ ID N0:7 in which the C at 
position 10 is replaced by a T, SEQ ID N0:17, SEQ. ID 
' NO: 18 in which the G at position 10 is replaced by a T, 
SEQ ID N0S:21 and 24, SEQ ID NO:25 in which the 
G at position 10 is replaced by an A, SEQ ID NO:29, 
and the perfect complements thereof, wherein the poly- 
morphic site occurs at position 10 in each of the SEQ 
ID NOS. 

12. The method of claim 11, wherein the determining 
comprises determining a set of bases occupying a set of 
polymorphic sites in a set of sequences selected from the 
group consisting of SEQ ID N0S:2 and 4, SEQ ID N0:5 in 
which the Cat position 10 is replaced by a G, SEQ ID N0:7 
in which the C at position 10 is replaced by a T, SEQ ID NO: 
17, SEQ. ID NO: 18 in which the G at position 10 is replaced 
by a T, SEQ ID N0S:21 and 24, SEQ ID NO:25 in which 
the G at position 10 is replaced by an A, SEQ ID NO:29, and 
the perfect complements thereof, wherein the polymorphic 
site occurs at position 10 in each of the SEQ, ID NOS. 

13. The method of claim 12, wherein the nucleic acid is 
obtained from a plurality of individuals, and a base occu- 
pying one of the polymorphic sites is determined in each of 
the individuals, and the method further comprising testing 
each individual for the presence of a disease phenotype, and 
correlating the presence of the disease phenotype with the 
base. 

***** 
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Description 

COPYRIGHT NOTICE 

5 A portion of the disclosure of this patent document contains material which is subject to copyright protection. The 
copyright owner has no objection to the xeroxographic reproduction by anyone of the patent document or the patent 
disclosure in exactly the form it appears in the Patent and Trademark Office patent file or records, but othenvise reserves 
all copyright rights whatsoever. 

10 GOVERNMENT RIGHTS NOTICE 

Portions of the material in this specification arose in the course of or under contract nos. 92ER81 275 (SBIR) between 
Affymetrix. Inc. and the Department of Energy and/or H600813-1 . -2 between Affymetrix. Inc. and the National Institutes 
of Health. 

IS 

BACKGROUND OF THE INVENTION 

The present invention relates to the field of computer systems. More specifically the present invention relates to 
computer systems for visualizing biological sequences, as well as for evaluating and comparing biological sequences. 

20 Devices and computer systems for forming and using arays of materials on a substrate are known. For example, 
PCT applications WO92/10588 and 95/11995. incorporated herein by reference for all purposes, describe techniques 
for sequencing or sequence checking nucleic acids and other materials. Arrays for performing these operations may be 
formed in an^ays according to the methods of. for example, the pioneering techniques disclosed In U.S. Patent Nos. 
5,445,934 and 5384,261. and U.S. Patent Application No. 08/249,188, each incorporated herein by reference for all 

25 purposes. 

According to one aspect of the techniques described therein, an array of nucleic acid probes is fabricated at known 
locations on a chip or substrate. A labeled nucleic acid is then brought into contact with the chip and a scann^ generates 
an image file (also called a cell file) indicating the locations where tiie labeled nucleic adds bound to the chip. Based 
upon the image file and identities of the probes at specific locations, it becomes possible to extract Information such as 
30 the monomer sequence of DNA or RNA. Such systems have been used to form, for example, arrays of DNA tiiat may 
be used to study and detect mutations relevant to cystic fibrosis, the P53 gene (relevant to certain cancers), HIV. and 
other genetic characteristics. 

Inproved computer systems and metiiods are needed to evaluate, analyze, and process the vast amount of infor- 
mation now used and made available by these pioneering technologies. 

35 

SUMMARY OF THE INVENTION 

An improved conputer-aided system for visualizing and determining the sequence of nucleic adds is disclosed. 
The computer system provides, among other tilings, improved methods of analyzing fluorescent image files of a chip 
40 containing hybridized nucleic acid probes in order to call bases in sample nucleic add sequences. 

According to one aspect of the invention, a computer system is used to identify an unknown base in a sample nudeic 
acid sequence by the steps of: 

- inputting multiple probe intensities, each of the probe intensities being assodated with a nucleic acid probe; 
45 - the computer system comparing the multiple probe intensities where each of the probe intensities is sut}stantially 
proportional to a nucleic add probe hybridizing witii at least one nudeic acid sequence; and 

calling the unknown base according to tiie results of the conrparison of Ihe multiple probe intensities. 

According to one specific aspect of the invention, a higher probe intensity is compared to a lower probe intensity to 
so call the unknown base. According to another specific aspect of tiie inventic^. prot>e intensities of a sample sequence 
are compared to probe intensities of a reference sequence. According to yet another specific aspect of the invention, 
probe intensities of a sample sequence are compared to statistics atx)ut probe intensities of a reference sequence from 
multiple experiments. 

According to another aspect of tiie invention, a mettiod is disdosed of processing reference and sample nucleic 
55 add sequences to reduce tiie variations between tiie experiments by tiie steps of: 

providing a plurality of nudeic acid probes; 

labeling tiie reference nucleic add sequence with a first marker; 

labeling tiie sample nucleic acid sequence wrtii a second marker; and 
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hybridizing the labeled reference and sample nucleic acid sequences at the same time. 

According to another aspect o1 the invention, a computer system is used to identify mutations in a sanrple nucleic 
acid sequence by the steps of: 

5 - inputting a first set of probe intensities, each of the probe intensities in said f isrt set being associated with a nucleic 
acid probe and substantially proportional to the associated nucleic acid probe hybridizing with a reference nucleic 
acid sequence; 

inputting a second set of probe intensities, each of the probe intensities in said f isrt set being associated with a 
nucleic acid probe and substantially proportional to the associated nucleic acid probe hyt)ridizing with said sample 
10 sequence; 

- the computer system comparing probe intensities in the first set to probe intensities in the second set to select 
hybridization regions where the probe intensities in the first and second sets differ; and 

identifying mutations according to characteristics of the selected regions. 
75 According to yet another aspect of the invention, a computer system is used for comparative analysis and visuali- 
zation of multiple sequences by the steps of: 

displaying at least one reference sequence in a first area on a display device; and 

- displaying at least one sample sequence in a second area on said display device; 

20 

whereby a user is capable of visually comparing the multiple sequences. 

A further understanding of the nature and advantages of the inventions herein may be realized by reference to the 
remaining portions of the specification and the attached drawings. 

25 BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates an exannple of a computer system used to execute the software of the present invention; 

Fig. 2 shows a system block diagram of a typical computer system used to execute the software of the present 

invention; 

30 F\Q. 3 illustrates an overall system for forming and analyzing arrays of biological materials such as DNA or RNA; 

Fig. 4 is an illustration of the software for the overall system; 

Fig. 5 Illustrates the global layout of a chip formed in the overall system; 

Rg. 6 illustrates conceptually the binding of probes on chips; 

Fig. 7 illustrates probes an^nged in lanes on a chip; 
35 Rg. 8 illustrates a hybridization pattern of a target on a chip with a reference sequence as in Rg. 7; 

Rg. 9 illustrates the high level flow of the intensity ratio method; 

Fig. 1 0A illustrates the high level flow of one implementation of the reference method and Rg. 1 0B shows an analysis 
table for use with the reference method; 

Rg. 1 1 A illustrates the high level flow of another implementation of the reference method; Fig. 1 1 B shows a data 
40 table for use with the reference method; Fig. 1 1 C shows a graph of the normalized sanple base intensities minus 
the normalized reference t>ase intensities; and Fig. 11 D shows other graphs of data in the data table; 
Fig. 12 illustrates the high le/el flow of the statistical method; 

Fig. 13 illustrates the pooling processing of a reference and sample nucleic add sequence; 
Rgs. 14A and 14C show graphs of scaled fluorescent intensities of wild-type probes hybridizing with sample and 
45 reference sequences and 1 4B shows a hypothetical graph of fluorescent intensities of wild-type probes hybridizing 
with two sample sequences and a reference sequence; 

Rg. 1 5 illustrates the high level flow of an embodiment that uses the hybridization data from than one base position 
to identify mutations in a sample sequence; 

Rg. 16 illustrates the main screen and the assodated pull down menus for comparative analysis and visualization 
50 of multiple experiments; 

Rg. 17 illustrates an intensity graph window for a selected base; 
Rg. 18 illustrates multiple intensity graph windows for selected bases; 

Fig. 19 illustrates the intensity ratio method correctiy calling a mutation in solutions with varying concentrations; 
Fig. 20 illustrates the reference method correctiy calling a mutant base where the intensity ratio method incorrectiy 
55 called the mutant base; and 

Fig. 21 illustrates the output of the ViewSeq™" program with four pretreatment samples and four posttreatment sam- 
ples. 
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DESCRIPTION OF THE PREFERRED EMBODIMENT 
CONTENTS 

5 1. General 

II. Intensity Ratio Methcxl 

III. Reference Method 

IV. Statistical Method 

V. Pooling Processing 

10 VI. Comparative Analysis 
VII. Examples 

I. General 

75 In the description that follows, the present invention will be described in reference to a Sun Workstation in a UNIX 
environment. Th^ present invention, however, is not limited to any particular hardware or operating system environment. 
Instead, those skilled in the art will find that the systems and methods of the present invention may be advantageously 
applied to a variety of systems, including IBM persona! computers running MS-DOS or Microsoft Windows. Therefore, 
the following description of specific systems are for purposes of illustration and not limitation. 

20 Rg. 1 illustrates an example of a computer system used to execute the software of the present invention. Rg. 1 
shows a computer system 1 which Includes a monitor 3, screen 5, cabinet 7, keyboard 9, and mouse 1 1 . Mouse 1 1 may 
have one or more buttons such as mouse buttons 13. Cabinet 7 houses a floppy disk drive 14 and a hard drive (not 
shown) that may be utilized to store and retrieve software programs incorporating the present invention. Although a 
floppy disk 15 is shown as the renwable media, other removable tangible media Including CD-ROM, flash memory and 

25 tape may be utilized. Cabinet 7 also houses familiar conputer components (not shown) such as a processor, memory 
and the like. 

Rg. 2 shows a system block diagram of conputer system 1 used to execute the software of the present invention. 

As in Fig. 1, computer system 1 includes nrwnitor 3 and keyboard 9. Computer system 1 further includes subsystems 

such as a central processor 52, system menrrory 54. I/O controller 56. display adapter 58, serial port 62. disk 64. network 
30 interface 66. and speaker 68. Disk 64 is representative of an internal hard drive, floppy drive, CD-ROM, flash memory 

tape, or any other storage medium. Other computer systems suitable for use with the present invention may include 

additional or fewer subsystems. For example, another computer system could include more than one processor 52 (i.e.. 

a multi-processor system) or memory cache. 

Arrows such as 70 represent the system bus architecture of computer system 1 . However, these arrows are illus- 
35 trative of any interconnection scheme serving to link the subsystems. For example, speaker 68 could be connected to 

the other subsystems through a port or have an internal direct connection to central processor 52. Computer system 1 

shown in Fig. 2 is but an example of a computer system suitable for use with the present invention. Other configurations 

of subsystems suitable for use with the present invention will be readily apparent to one of ordinary skill in the art. 
The VLSIPS™ technology provides methods of making very large anrays of oligonucleotide probes on very small 
40 chips. See U.S. Patent No. 5.143.854 and PCT patent publication Nos. WO 90/15070 and 92/10092. each of which is 

incorporated by reference for all purposes. TTie oligonucleotide probes on the DfslA probe array are used to detect 

complementary nucleic add sequences in a sample nucleic add of interest (the larget" nucleic add). 

The present invention provkJes metinods of analyzing hybridization intensity files for a chip containing hybridized 

nudeic acid probes. In a representative enixxitment, tiie files represent fluorescence data from a biological anray, but 
45 the foes may also represent other data such as radioactive intensity data or large nrx>lecule detection data. Therefore, 

the present Invention Is not limited to analyzing fluorescent measurements of hybridizations but may be readily utilized 

to analyze other measurements of hytxidization. 

For purposes of illustration, the present invention is desatbed as being part of a computer system that designs a 

chip mask, synthesizes the probes on the chip, labels the nucleic acids, and scans the hybridized nudeic acid probes. 
50 Such a system is fully described in U.S. Patent Application No. 08/249,188 which has been incorporated by reference 

for all purposes. However, the present invention may be used separately from tiie overall system for analyzing data 

generated by such systen^ 

Rg. 3 illustrates a computerized system for forming and analyzing arrays of biological materials such as RNA or 
DNA. A computer 100 Is used to design anays of biological polymers such as RNA or DNA. The computer 100 may be, 
55 for example, an appropriately programmed Sun Workstation or personal computer or workstation, such as an IBM PC 
equivalent, including appropriate menrory and a CPU as shown in Rgs. 1 and 2. The corrputer system 100 obtains 
inputs from a user regarding characteristics of a gene of interest, and other inputs regarding the desired features of the 
array Optionally, tiie computer system may obtain information regarding a specific genetic sequence of interest from an 
external or internal database 102 such as GenBank. The output of the computer system 100 is a set of chip design 
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computer f iles 1 04 in the form of, for example, a switch matrix, as described in PCT application WO 92/1 0092. and other 
associated computer files. 

The chip design files are provided to a system 106 that designs the lithographic masks used in the fabrication of 
arrays of nfx>lecules such as DNA The system or process 106 may include the hardware necessary to manufacture 

5 masks 1 1 0 and also the necessary computer hardware and software 1 08 necessary to lay the mask patterns out on the 
mask in an efficient manner. As with the other features in Fig. 3, such equipment may or may not be located at the same 
physical site, but is shown together for ease of illustration in Fig. 3. The system 106 generates masks 110 or other 
synthesis patterns such as chrome-on-glass masks for use in the fabrication of polymer arrays. 

The masks 1 10, as well as selected information relating to the design of the chips from system 100, are used in a 

10 synthesis system 1 12. Synthesis system 1 1 2 includes the necessary hardware and software used to fabricate arrays of 
polymers on a substrate or chip 1 1 4. For example, synthesizer 1 1 2 includes a light source 116 and a chemical flow cell 
1 1 8 on which the substrate or chip 1 1 4 is placed. Mask 1 1 0 is placed between the light source and the substrate/chip, 
and tiie two are translated relative to each other at appropriate times for deprotection of selected regions of the chip. 
Selected chemical reagents are directed through flow cell 1 1 8 for coupling to deprotected regions, as well as for washing 

75 and other operations. All operations are preferably directed by an appropriately programmed conputer 1 1 9, which may 
or may not be tfie same conputer as the computer(s) used in mask design and mask making. 

The substrates fabricated by synthesis system 112 are optionally diced into smaller chips and exposed to marked 
receptors. TTte receptors may or may not be complementary to one or wore of the molecules on the substrate. The 
receptors are marked witii a label such as a fluorescein label (indicated by an asterisk in Fig. 3) and placed in scanning 

20 system 1 20. Scanning system 1 20 again operates under the direction of an appropriately programmed digital computer 
122, which also may or may not be tiie same computer as ttie computers used in synthesis, mask making, and mask 
design. The scanner 120 includes a detection device 124 such as a confocal microscope or CCD (charge-coupled 
device) ttiat is used to detect the locations where labeled receptor (*) has bound to ttie substrate. The output of scanner 
120 is an image fiie(s) 124 indicating, in the case of fluorescein labeled receptor, the fluorescence intensity (photon 

25 counts or other related measurements, such as voltage) as a function of position on the sii)strate. Since higher photon 
counts will be observed where the labeled receptor has bound more strongly to tiie an-ay of polymers, and since the 
monomer sequence of the polymers on the substrate is known as a function of position, it becomes possible to determine 
the sequence(s) of polymer(s) on the substrate tiiat are complementary to tiie receptor. 

The image file 124 is provided as input to an analysis system 126 that incorporates the visualization and analysis 

30 methods of tiie present invention. Again, the analysis system may be any one of a wide variety of computer system(s). 
but in a preferred embodiment tiie analysis system is based on a Sun Workstation or equivalent. The present invention 
provides various methods of analyzing tiie chip design files and the image files, providing appropriate output 128. The 
present invention may furtiier be used to identify specific mutations in a receptor such as DNA or RNA. 

Fig. 4 provkies a simplified illusti^tion of tiie overall software system used in the operation of one embodiment of 

35 the invention. As shown in Fig. 4, in some cases (such as sequence checking systems) the system first identifies the 
genetic sequence(s) or targets that would be of interest in a particular analysis at step 202. The sequences of Interest 
may, for exanrple, be normal or mutant portions of a gene, genes tiiat identify heredity, or provide forensic information, 
or be all possible n-mers (where n represents the lengtii of the nucleic acid). Sequence selection may be provided via 
manual Input of text files or may be from external sources such as GenBank. At step 204 the system evaluates tiie gene 

40 to determine or assist the user in determining which probes would be desirable on tiie chip, and provides an appropriate 
"layout" on tiie chip for the probes. The chip usually includes probes that are complementary to a reference nucleic acid 
sequence which has a known sequence. A wiki-type probe is a probe that will Ideally hybridize with the reference 
sequence and thus a wild-type gene (also called tiie chip wild-type) would ideally hybndize witii wild-type probes on the 
chip. The target sequence is substantially similar to tiie reference sequence except for tiie presence of mutations, inser- 

45 tions, deletions, and the like. The layout implements desired characteristics such as arrangement on the chip tiiat permits 
"reading** of genetic sequence and/or ntinimization of edge effects, ease of syntiiesis, and the like. 

Rg. 5 lllustiates tiie global layout of a chip in a particular embodiment used for sequence checking applications. 
Chip 1 14 is composed of multiple units where each unit may contain different tilings for the chip wild-type sequence. 
Unit 1 is shown in greater detail and shows that each unit is composed of nujltiple cells which are areas on the chip that 

50 may contain probes. Conceptually, each unit is composed of multiple sets of related cells. As used herein, the term cell 
refers to a region on a substrate that contains many copies of a molecule or molecules of interest. Each unit is composed 
of nxiltiple cells that may be placed in rows (or "lanes") and columns. In one embodiment, a set of five related cells 
includes ttie following: a wikJ-type cell 220, "mutation" cells 222, and a *t>lank" cell 224. Cell 220 contains a wikJ-type 
probe that is the complement of a portion of tiie wild-type sequence. Cells 222 contain "mutation" probes for ttie wild- 

55 type sequence. For example, if tiie wiW-type probe is 3*-ACGT, the probes 3*-ACAT. 3*-ACCT. 3'-ACGT, and 3*-ACTT 
may be tiie "mutation" probes. Cell 224 is tiie "blank" cell because it contains no probes (also called the "blank" probe). 
As the blank cell contains no probes, labeled receptors should not bind to ttie chip in this area. Thus, tiie blank cell 
provides an area that can be used to measure tiie background intensity. 
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In one ennbodiment, numerous tiling processes are available including sequence tiling, block tiling, and opt*tiling. 
as described below. Of course a wide range of layout strategies may be used according to the invention herein, without 
departing from the scope of the Invention. For example, the probes may be tiled on a substrate in an apparently random 
fashion where a conputer system is utilized to keep track of the probe locations and correlate the data obtained from 
5 the substrate. 

Opt-tiling is the process of tiling additional probes for suspected mutations. As a simple example of opt-tiling, suppose 
the wild-type target sequence is 5*-ACGTATGCA-3' and it Is suspected tiiat a mutant sequence has a possible T base 
nuitation at the underlined base position. Suppose further that the chip will be synthesized with a "4x3" tiling strategy, 
meaning tiiat probes of four monomers are used and tiiat the monomers in position 3. counting left to right, of the probe 
10 are varied. 

In opt-tiling, extra probes are tiled for each suspected mutation. The extra probes are tiled as If the mutation base 
Is a wild-type base. The following shows the probes that may be generated for this example: 



Table 1 



Probe Sequences (From 3*-end) 4x3 Opt-Tiling 


Wild 


TGCA 


GCAT 


CATA 


ATAC 


TACG 


A sub. 


TGAA 


GCAT 


CAAA 


ATAC 


TAAG 


Csub. 


TGCA 


GCCT 


CACA 


ATCC 


TACG 


Gsub. 


TGGA 


GCGT 


CAGA 


ATGC 


TAGG 


Tsub. 


TGTA 


GCTT 


CATA 


ATTC 


TATG 


Wild 


TGCA 


GCAA 


CAAA 


AAAC 


AACG 


A sub. 


TGAA 


GCAA 


CAAA 


AAAC 


AAAG 


Csub. 


TGCA 


GCCA 


CACA 


AACC 


AACG 


Gsub. 


TGGA 


GCGA 


CAGA 


AAGC 


AAGG 


Tsub. 


TGTA 


GCTA 


CATA 


AATC 


AATG 



In the first "chip" above, the top row of the probes (along wrth one probe below each of the four wild-type probes) should 
bind to the target DNA sequence. However, if the target sequence has a T base mutation as suspected, the labeled 

35 mutant sequence will not bind that strongly to ttie probes in the columns around column 3. For example, the mutant 
receptor that could bind with the probes in column 2 is 5*-CGTT which may not bind tiiat strongly to any of the probes 
in column 2 because there are T bases at the ends of tiie receptor and probes (i.e., not complementary). This often 
results in a relatively dark scanned area around a mutation. 

Opt-tiling generates the second "chip" above to handle the suspected mutation as a wild-type base. Thus, the mutant 

40 receptor 5'-CGTT should bind strongly to the wild-type probe of column 2 (along with one probe below) and the nrrutation 
can be further detected. 

Again referring to Rg. 4. at step 206 the masks for the synthesis are designed. At step 208 the software utilizes the 
mask design and layout information to make the DNA or otfier polymer chips. This software 208 will control relative 
translation of a substrate and the mask, the flow of desired reagents through a flow cell, the synthesis temperature of 

45 the flow cell, and other parameters. At step 210, another piece of software Is used In scanning a chip thus synthesized 
and exposed to a labeled receptor. The software controls the scanning of the chip, and stores the data thus obtained in 
a file that may later be utilized to extract sequence Informatioa 

At step 212 a computer system according to the present Invention utilizes the layout Information and thef luorescence 
information to evaluate the hybridized nucleic add probes on the chip. Among the Important pieces of information 

50 Obtained from probe arrays are the Identification of mutant receptors and determination of genetic sequence of a par- 
ticular receptor. 

Rg. 6 illustrates the binding of a particular target DNA to an array off DNA probes 114. As shown in this sinrple 
exanrtple, the following probes are formed In the an^y (only one probe is shown for the wiU-type probe): 

55 
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3 ' -AGAACGT 
AGACCGT 
AGAGCGT 
AGATCGT 



10 



As shown, the set of probes differ by only one base so the probes are designed to determine the identity of the base at 

75 that position in the nucleic add sequence. 

When a f luorescein-labeled (or otherwise marked) target with the sequence 5*-TCTTGCA is exposed to the array, 
it is complementary only to the probe 3*-AGAAGGX and fluorescein will be primarily found on the surface of the chip 
where 3*-AGAACGT is located. Thus, for each set of probes that differ by only one base, the image file will contain four 
fluorescence intensities, one for each probe. Each fluorescence intensity can therefore be associated with the base of 

20 each probe that is different from the other probes. Additionally, the image file will contain a "blank" cell which can be 
used as tiie fluorescence intensity of the background. By analyzing the five fluorescence intensities associated with a 
specific base location, it becomes possible to extract sequence information from such arrays using the metiiods of the 
invention disclosed herein. 

Fig. 7 illustrates probes anranged in lanes on a chip. A reference sequence Is shown with five irtten^ogation positions 

25 marked with number subscripts. An inten-ogation position is a base position in the reference sequence where tiie target 
sequence may contain a mutation or othenvise differ from the reference sequence. The chip may contain five probe cells 
that correspond to each interrogation position. Each probe cell contains a set of probes that have a common base at 
the interrogation position. For example, at the first interrogation position, li, the reference sequence has a base T. The 
wild-type probe for tiiis interrogation position is 3*-TGAC where the base A in tiie probe is complementary to the base 

30 at the interrogation position in the reference sequence. 

Similarly, there are four "mutant" probe cells for tiie first interrogation position, h. The four mutant probes are 3'- 
TGAG, 3'-TGCC, 3'-TGGG, and 3'-TGTG. Each of the four mutant probes vary by a single base at the interrogation 
position. As shewn, the wild-type and mutant probes are an-anged in lanes on the ch'p. One of tiie mutant probes (in 
this case 3'-TGAC) is identical to tiie wild-type probe and therefore does not evidence a mutation. However, tiie redun- 

35 dancy gives a visual indication of mutations as will be seen in Fig. 8. 

Still referring to Fig. 7, the chip contains wild-type and mutant probes for each of the other inten'ogation positions 
Iris- In each case, the wild-type probe is equivalent to one of tiie mutant probes. 

Fig. 8 illustrates a hybridization pattern of a target on a chip witti a reference sequence as in Rg. 7. The reference 
sequence is shown along the top of the chip for comparison. The chip includes a WT-lane (wild-type), an A-lane, a G- 

40 lane, a G-lane, and a T-lane (or U). Each lane is a row of cells containing probes. The cells in the V\rr-lane contain probes 
that are complementary to the reference sequence. The ceils In the A-, G-, and T-lanes contain probes that are 
complementary to tiie reference sequence except that tiie named base is at the interrogation position. 

In one embodiment, tiie hybridization of probes in a cell is determined by the fluorescent Intensity (e.g., photon 
counts) of tiie cell resulting from the binding of marked target sequences. The fluorescent Intensity may vary greatiy 

45 among cells. For sinrplicity. Fig. 8 shows a high degree of hybridization by a cell containing a darkened area. The WT- 
lane allows a simple visual indication tiiat there is a mutation at interrogation position U because tiie wild-type cell is not 
dark at tiiat position. The cell in tiie G-lane is darkened which indicates tiiat the mutation is from T->G (mutant probe 
cells are complementary so the C-cell indicates a G mutation). 

In practice, the fluorescent intensities of cells near an interrogation position having a mutation are relatively dark 

so creating "dark regions" around a mutation. The lower fluorescent intensities result because the ceils at Interrogation 
positions near a mutation do not contain probes that are perfectly complementary to the target sequence; thus, tiie 
hybridization of tiiese probes with tiie target sequence is lower. For example, the relative intensity of the cells at inter- 
rogation positions I3 and I5 may be relatively low because none of the probes therein are complementary to the target 
sequence. 

55 
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For ease of reference, one may call bases by assigning the bases the following codes: 



CkxJe 


Group 


Meaning 


A 


A 


Adenine 


C 


C 


Cytosine 


G 


G 


Guanine 


T 


T(U) 


Thymine (Uracil) 


M 


AorC 


aMIno 


R 


AorG 


puRine 


W 


AorT(U) 


Weak interaction (2 H bonds) 


Y 


CorT(U) 


pYrimidine 


S 


CorG 


Strong interaction (3 H bonds) 


K 


GorT(U) 


Keto 


V 


A. C or Q 


notT(U) 


H 


A. CorT(U) 


notG 


D 


A,QorT(U) 


note 


B 


C,GorT(U) 


not A 


N 


A, C. G, orT(U) 


Insufficient intensity to call 


X 


A, C. G.orT(U) 


Insuff ident discrimination to call 



30 

Most of the codes conform to the lUPAC standard. Howe\^er, code N has been redefined and code X has been added. 
II. Intensity Ratio Method 

35 The intensity ratio method is a method of calling bases in a sample nucleic acid sequence. The intensity ratio method 
is most accurate when there is good discrimination between the fluorescence intensities of hybrid matches and hybrid 
mismatches. If there Is insufficient discrimination, the intensity ratio method assigns a corresponding ambiguity code to 
the unknown base. 

For simplicity, the intensity ratio method will be described as being used to klentify one unknown base in a sample 
40 nucleic acid sequence. In practice, the method is used to identify many or all the bases in a nucleic acid sequence. 

The unknown base will be identified by evaluation of up to four mutation probes and a l)tank" cell, which Is a location 
where a labeled receptor should not bind to the chip since no probe is present. For example, suppose a DNA sequence 
of interest or target sequence contains the sequence 5'-AGAA£CTGC-3' with a possible mutation at the underlined base 
position. Suppose that 5-mer probes are to be synthesized for the target sequence. A representative wild-type probe of 
45 5'-TTGGA is complementary to the region of the sequence around the possible mutation. The "mutation" probes will be 
the same as the wild-type probe except for a different base at the third position as follows: 3*-TTAGA, 3*-TTCGA. 3'- 
TTG6A, and 3*-TTTGA. 

If the f luorescently marked sannple sequence is exposed to the above four mutation probes, the intensity should be 
highest for the probe that binds most strongly to the sanrple sequence. Therefore, if the probe 3'-TTTGA shows the 
50 highest intensity, the unknown base in the sample will generally be called an A mutation because the probes are com- 
plementary to the sample sequence. 

The mutation probes are Identical to the wild-type probes except that they each contain one of the four A, C, G, or 
T "mutations" for the unknown base. Although one of the "mutation" probes will be identical to the wild-type probe, such 
redundant probes are intentionally synthesized for quality control and design consistency. 
55 The identity of the unknown base is preferably determined by evaluating the relative fluorescence intensities of up 
to four of the mutation probes, and the "blank" cell. Because each mutation probe is Identifiable by the mutation base, 
a mutation probe's Intensity will be referred to as the "base Intensity" of tiie mutation base. 

As a simple example of the intensity ratio method, suppose a gene of interest (target) is an HIV protease gene with 
the sequence 5'-ATGTGGA£AGTTGTA-3* (SEQ ID N0:1). Suppose further that a sanrple sequence is suspected to 
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have the same sequence as the target sequence except for a mutation of base C to base T at the underlined base 
position. Although hundreds of probes may be synthesized on the chip, the complementary mutation probes synthesized 
to detect a nujtation in the sample sequence at the suspected mutation position may be as follows: 

3'-TATC 
5 3'-TCTC 

3'-TGTC (wild-type) 

3'-TTTC 

The mutation probe 3*-TGTC is also the wild-type probe as it should bind most strongly with the target sequence. 
After the sample sequence is labeled, hybridized on the chip, and scanned, suppose the following fluorescence 
10 intensities were obtained: 
3'-TATC ->45 
3'-TCTC -> 8 
3'-TGTC -> 32 
3*-TTTC ->12 

15 where the intensity is measured by the photon count detected by the scanner. The "blank" cell had a fluorescence 
intensity of 2. The photon counts in the exanples herein are representative (not actual data) and provided for illustration 
purposes. In practice, the actual photon counts will vary greatly depending on the experiment parameters and the scanner 
utilized. 

Although each fluorescence intensity is from a probe, the probes may be characterized by their unique mutation 
20 base so the bases may be said to have the following intensities: 
A->45 
C->8 
G->32 
T->12 

25 Thus, base A will be desaibed as having an intensity of 45. which corresponds to the intensity of the mutation probe 
with the mutation base A. 

Initially, each mutation base intensity is reduced by the background or "Wank" cell intensity. This is done as follows: 

A-> 45-2 = 43 

30 

C->8-2 = 6 
G-> 32-2 = 30 

35 T->12-2 = 10 

Then, the base intensities are sorted in descending order of intensity. The above bases would be sorted as follows: 
A->43 
G->30 
40 T->10 
C->6 

Next, the highest intensity base is conpared to the second highest intensity base. Thus, the ratio of the intensity of base 
A to the intensity of base G is calculated as follows: A:G = 43 / 30 = 1 .4. The ratio A:G is then compared to a predetermined 
ratio cutoff, which is a nuntfjer that specifies the ratio required to identify the unknown base. For example, if the ratio 
45 cutoff is 1 .2, the ratio A:G is greater than the ratio cutoff (1 .4 > 1 .2) and the unknown base is called by the mutation 
probe containing the nrtutation A. As probes are complementary to the sarrple sequence, the sample sequence is called 
as having a mutation T, resulting in a called sample sequence of 5*-ATGTGGADVQTTGTA-3* (SEQ ID N0:2). 

As another example, suppose everything else is the same as in the previous example except that the sorted back- 
ground adjusted intensities were as follows: 
50 C->42 
A->40 
G->10 
T->8 

The ratio of the highest intensity base to the second highest intensity base (C:A) is 1 .05. Because this ratio is not greater 
55 than the ratio cutoff of 1 .2, the unknown base will be called as being ambiguously one of two or more bases as follows. 
The second highest intensity t>ase is then compared to the third highest base. The ratio of A:G is 4. The ratio of A:G 
is then conpared to the ratio cutoff of 1 .2. As the ratio A.<3 is greater than the ratio cutoff (4 > 1 .2), the unknown base 
is called by the mutation probes containing the mutations C or A. As probes are complementary to the sample sequence, 
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the sanpte sequence is called as having either a mutation G or T resulting in a sannple sequence of 5*-ATQTGGAK: 
AGTTGTA-3' (SEQ ID N0:3) where K is the lUPAC code for G or T(U). 

The ratio cutoff in the previous examples was equal to 1 .2. However, the ratio cutoff will generally need to be adjusted 
to produce optimal results for the specific chip design and wild-type target. Also, although the ratio cutoff used has been 
5 the same for each ratio comparison, the ratio cutoff may vary depending on whether the ratio corrparisons involve the 
highest, second highest, third highest, etc. intensity base. 

Rg. 9 illustrates the high level flow of the intensity ratio method. At step 302 the four base intensities are adjusted 
by subtracting the background or "blank" cell intensity from each base intensity. Preferably, if a base intensity is then 
less than or equal to zero, the base intensity is set equal to a small positive number to prevent division by zero or negative 
10 numbers in future calculations. 

At step 304 the base intensities are sorted by intensity. Each base is then associated with a nun^er from 1 to 4. 
The base with the highest intensity is 1 , second highest 2, third highest 3, and fourth highest 4. Thus, the intensity of 
base 1 & base 2 & base 3 & base 4. 

At step 306 the highest intensity base (base 1) is checked to see if it has sufficient intensity to call the unknown 
75 base. The intensity is checked by determining if the Intensity of base 1 is greater than a predeternnined background 
difference cutoff. The background difference cutoff is a number that specifies the intensity a base intensity must be over 
the background intensity in order to con^ectly call the unknown base. Thus, the background adjusted base intensity must 
be greater than the background difference cutoff or the unknown is not callable. 

If the intensity of base 1 is not greater than the background difference cutoff, the unknown base is assigned the 
20 code N (insufficient intensity) as shown at step 308. Othenwise, the ratio of the intensity of base 1 to base 2 is calculated 
as shown at step 310. 

At step 312 the ratio of intensity of bases 1 2 is compared to the ratio cutoff. If the ratio 1 :2 is greater than the ratio 
cutoff, tiie unknown base is called as the complement of the highest intensity base (base 1) as shown at step 314. 
Othenvise. the ratio of the intensity of base 2 to base 3 is calculated as shown at step 316. 

25 At Step 31 8 the ratio of intensity of bases 2:3 is compared to the ratio cutoff. If the ratio 2:3 is greater than the ratio 
cutoff, the unknown base is called as being an ambiguity code spedfying the complements of the highest or second 
highest intensity bases (base 1 or 2) as shown at step 320. Othenwise, the ratio of tiie intensity of base 3 to base 4 is 
calculated as shown at step 322. 

At step 324 the ratio of intensity of bases 3:4 is compared to the ratio cutoff. If the ratio 3:4 is greater than the ratio 

30 cutoff, tiie unknown base is called as being an ambiguity code specifying the complements of the highest, second 
highest, or third highest bases (base 1. 2 or 3) as shown at step 326. Ottienwise. tiie unknown base is assigned the 
code X (insufficient discrimination) as shown at step 328. 

The advantage of the intensity ratio mettiod is that it Is very accurate wfhen there is good discrimination between 
the fluorescence intensities of hybrid matches and hybrid mismatches. However, if the base corresponding to a correct 

35 hybrid gives a lower intensity than a mismatch (e.g.. as a result of cross-hybridization), inconrect identification of tiie 
base will result. For tiiis reason, however, tiie method is useful for comparative assessment of hybrWization quality and 
as an indicator of sequence-specific problem spots. For example, the intensity ratio method has been used to determine 
that ambiguities and miscalls tend to be very differ^ from sequence to sequence, and reflect predominantiy the com- 
position and repetitiveness of tiie sequence. It has also been used to assess improvements obtained by varying hybrid- 

40 ization conditions, sample preparation, and post-hybridization treatments (e.g.. RNase treatment). 

111. Reference Method 

The reference metiiod is a method of calling bases in a sample nucleic acid sequence. The reference metiiod 
45 depends very little on discrimination between the fluorescence intensities of hybrid matches and hybrid mismatches, 
and therefore is much less sensitive to cross-hybridization. The method compares tiie probe intensities of a reference 
sequence to the probe intensities of a sample sequence. Any significant changes are flagged as possible mutations, 
there are two implementations of the reference method disclosed herein. 

For simplidty. the reference metiiod will be described as being used to identify one unknown base in a sample 
50 nucleic acid sequence. In practice, the mettiod is used to identify many or alt the bases in a nucleic add sequence. 

The unknown base will be called by comparing the probe intensities of a reference sequence to the probe intensities 
of a sample sequence. Preferably, tiie probe intensities of tiie reference sequence and the sample sequence are from 
chips having the same chip wild-type. However, tiie reference sequence may or may not be exactiy ttie same as the chip 
virild-type, as it may have mutations. 
55 The bases at the same position in ttie reference and sample sequences wilt each be associated witti up to four 
mutation probes and a iDlank" cell. The unknown base in the sample sequence is called by comparing probe intensities 
of the sample sequence to probe intensities of the reference sequence. For example, suppose ttie chip wild-type contains 
the sequence 5 •AGACCTTGC-3' and it is suspected that the sample has a possible mutation at ttie underlined base 
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position, which is the unknown base that will be called by the reference method. The "mutation" probes for the sample 
sequence may be as follows: 3'-GAAA. 3*-GCAA. 3'-GGAA, and 3'-GTAA. where 3'-GGAA is the wild-type probe. 

Suppose further that a reference sequence, which differs from tiie chip wild-type by one base mutation, has the 
sequence S-AGACATTGC-S' where the mutation base is underlined. The "mutation" probes for the reference sequence 

5 may be as follows: 3*-TGAAA, 3'-TGCAA. 3'-TGGAA, and 3'-TGTAA, where 3*-TGTAA is the reference wild-type probe 
since the reference sequence is known. Although generally tiie sample and reference sequences w«'e tiled with tiie 
same chip wild-type, tiiis is not required, and the tiling metiiods do not have to be identical as shown by the use of two 
probe lengths in the example. Thus, tiie unknown base will be called by conparing the "mutation" probes of tiie sanrple 
sequence to the "mutation" probes of tiie reference sequence. As before, because each mutation probe is identifiable 

10 by ttie mutation base, tiie nrurtation probes' intensities will be referred to as tiie "base intensities" of ttieir respective 
mutation bases. 

As a simple exarrple of one inrplementation of tiie reference method, suppose a gene of interest (target) has the 
sequence y-AAAACTGAAAA-3' (SEQ ID N0:4). Suppose a reference sequence has the sequence 5*-AAAACQGAAAA- 
3' (SEQ ID N0:5), which differs from tiie target sequence by the underlined base. TTie reference sequence is marked 
IS and exposed to probes on a chip with the target sequence being ttie chip wild-type. Suppose furtiier that a sample 
sequence is suspected to have tiie same sequence as ttie target sequence except for a mutation at tiie underlined base 
position in 5'-AAAACIGAAAA-3* (SEQ ID N0:4). The sample sequence is also marked and exposed to probes on a 
chip with ttie target sequence being the chip wikJ-type. After hybridization and scanning, tiie following prol>e intensities 
(not actual data) were found for tiie respective complementary probes: 

20 



Reference 


Sample 


3*-TGAC->12 
3'-TGCC ->9 
3'-TGGC -> 80 
3'-TGTC ->15 


3'-GACT ->11 
3'-GCCT->30 
3*-GGCT->60 
3*-GTCT->6 



30 

Alttiough each fluorescence intensity is from a probe, tiie probes may be identified by their unique mutation base so ttie 
bases may be said to have the following intensities: 

35 



Reference 


Sample 


A->12 


A->11 


C->9 


C->30 


O->80 


Q->60 


T->15 


T->6 



45 

Thus, base A of ttie reference sequence will be described as having an intensity of 1 2, which corresponds to ttie intensity 
of ttie mutation probe with the mutation base A. The reference mettiod will now be described as calling the unknown 
base in ttie sample sequence by using these intensities. 

ng. 10A illustrates ttie high level flow of one implementation of the reference metttod. For illustration purposes, ttie 

50 reference metiiod is desaibed as filling in the columns fidentif ied by the numbers along the bottom) of tiie analysis table 
shown in Fig, 10B. However, the generation of an analysis table is not necessary to practice ttie mettiod. The analysis 
table is shown to aid the reader in understanding the method. 

At step 402 the four base intensities of the reference and sample sequences are adjusted by subtracting the back- 
ground or "blank" cell intensity from each base intensity. Each set of "mutation" probes has an associated "blank" cell. 

55 Suppose tiiat ttie reference "blank" cell intensity is 1 and ttie sanrple "blank" cell intensity is 2. The base intensities are 
then background subtracted as follows: 
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Reference 


Sample 


5 


A->12-1 =11 


A->11 -2 = 9 




C->9-1 =8 


C -> 30 - 2 = 28 




G-> 80-1 = 79 


G-> 60-2 = 58 


10 


T->15-1 =14 


T->6-2 = 4 



Preferably, if a base intensity is then less than or equal to zero, the base intensity is set equal to a small positive number 
to prevent division by zero or negative numbers in future calculations. 
15 For identification, the position of each base of interest in the reference and sanrple sequences is placed in column 
1 of the analysis table. Also, since the reference sequence is a known sequence, the base at this position is known and 
is referred to as the reference wild-type. The reference wild-type is placed in column 2 of the analysis table, which Is C 
for this example. 

At step 404 the base intensity associated with the reference wild-type (column 2 of the analysis table) is checked 
20 to see if it has sufficient intensity to call the unknown base. In this example, the reference wild-type is C. However, the 
base intensity associated with the wild-type is the G base intensity, which is 79 in this example. This is because the base 
intensities actually represent the complementary "mutation" probes. The G base intensity is checked by determining if 
its Intensity is greater than a predetermined background difference cutoff. TTie background difference cutoff is a number 
that specifies the intensity the base intensities must be above the background intensity in order to con-ectly call the 
25 unknown base. Thus, the base intensity associated with the reference wild-type must be greater than the background 
difference cutoff or the unknown base is not callable. 

If the background difference cutoff is 5, the base intensity associated with the reference wild-type has sufficient 
intensity (79 > 5) so a P (pass) is placed in column 3 of the analysis table as shown at step 406. Othenwise. at step 407 
an F (faiO Is placed In column 3 of the analysis table. 
30 At step 408 the ratio of the base intensity associated with the reference wild-type to each of the possible bases are 
calculated. The ratio of the base intensity associated with the reference wild-type to Itself will be 1 and the other ratios 
will usually be greater than 1 . The base intensity associated with the reference wild-type is G so the following ratios are 
calculated: 

35 G:A-> 79/11 =7.2 

G:C -> 79/8 = 9.9 
G:G-> 79 / 79 = 1.0 

40 

G:T->79/14 = 5.6 

These ratios are placed in columns 4 through 7 of the analysis table, respectively 

At step 410 the highest base Intensity associated with the sanrple sequence is checked to see if it has sufficient 
45 intensity to call the unknown base. The highest base intensity is checked by detennining if the intensity is greater than 
the background difference cutoff. Thus, the highest base intensity must be greater than the background difference cutoff 
or the unknown base is not callabla 

Again, if the background difference cutoff is 5, the highest base intensity, which is G in this example, has sufficient 
intensity (58 > 5) so a P (pass) is placed in column 8 of the analysis table as shown at step 412. Othenwise, at step 413 
50 an F (faiO is placed in column 8 of the analysis table. 

At step 414 the ratios of the highest base intensity of the sample to each of the possible bases are calculated. The 
ratio of the highest base intensity to itself will be 1 and the other ratios will usually be greater than 1 . Thus, the highest 
base intensity is G so the following ratios are calculated: 

55 G:A-> 58/9 = 6.4 

G:C-> 58/28 = 2.3 

G:G -> 58/58 = 1.0 
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G:T-> 58/4 = 14.5 

These ratios are placed in columns 9 through 12 of the analysis table, respectively. 

At step 416 if both the reference and sample sequence probes failed to have sufficient intensity to call the unknown 
5 base, meaning there is an 'F' in columns 3 and 8 of the analysis table, the unknown base is assigned the code N 
(insufficient intensity) as shown at step 41 8. An *N' is placed in column 1 7 of the analysis table. Additionally, a confidence 
code of 9 is placed in column 18 of the analysis table where the confidence codes have the following meanings: 



10 





Code 


Meaning 




0 


Probable reference wild-type 




1 


Probable mutation 


IS 


2 


Reference sufficient intensity, insufficient intensity in sample suggests possible mutation 




3 


Borderline differences, unknown base ambiguous 




4 


Sample sufficient intensity, insufficient intensity in reference to allow comparison 


20 


5-8 


Currently unassigned 




9 


Insufficient intensity in reference and sanple, no interpretation possble 



The confidence codes are useful for indicating to the user the resulting analysis of the reference method. 
25 At step 420 if only the reference sequence probes failed to have sufficient intensity to call the unknown base, meaning 
there is an 'F in column 3 and a *P' in column 8 of the analysis table, the unknown base is assigned the code N (insufficient 
intensity) as shown at step 422. An 'N' is placed in column 1 7 and a confidence code of 4 is placed in column 18 of the 
analysis table. 

At step 424 if only the sarrple sequence probes failed to have sufficient intensity to call the unknown base, meaning 
30 there is a in column 3 and a 'P in column 8 of the analysis table, the unknown base is assigned the code N (insufficient 
intensity) as shown at step 426. An 'N* is placed in column 1 7 and a confidence code of 2 is placed in column 18 of the 
analysis table. 

In this example, both the reference and sanple sequence probes have sufficient intensity to call the unknown base. 
At step 428 the ratios of the reference ratios to the sample ratios for each base type are calculated. Thus, the ratio A:A 
35 (column 4 to column 9) is placed in column 13 of the analysis table. The ratio C:C (column 5 to column 10) is placed in 
column 14 of the analysis table. The ratio G:G (column 6 to column 11) is placed in column 15 of the analysis table. 
Lastly, the ratio T:T (column 7 to column 12) is placed in column 16 of the analysis table. These ratios are calculated as 
follows: 

40 A:A-> 7.2 76.4 = 1.1 

C:C->9.9/2.3 = 4.3 
G:G-> 1.0/ 1.0 = 1.0 

45 

T:T-> 5.6/14.5 = 0.4 

The unknown base is called by conrparing these ratios of ratios to two predetermined values as follows. 

At step 430 if all the ratios of ratios (columns 13 to 16 of the analysis table) are less than a predetermined lower 
50 ratio cutoff, the unknown base is assigned the code of the reference wild-type as shown at step 432. Thus, the code for 

the reference wild-type (as shown in column 2) would be placed in column 17 and a confidence code of 0 would be 

placed in column 18 of the analysis table. 

At step 434 if all the ratios of ratios are less than a predetermined ipper ratio cutoff, the unknown base is assigned 

an ambiguity code that indicates the unknown base may be any one of the bases that has a complementary ratio of 
55 ratios greater than the lower ratio cutoff and less than tiie upper ratio cutoff as shown at step 436. Thus, if tiie ratio of 

ratios for A:A. C:C and G:G are all greater tiian tiie lower ratio cutoff and less than the upper ratio cutoff, the unknown 

base would be assigned the code B (meaning "not A"). This is because tiie ratios of ratios are complementary to their 

respective base as follows: 
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A:A->T 
C:C -> G 

5 G:G->C 

so the unknown base would be called as being either C. G, or X which is identified by the lUPAC code B. This ambiguity 
code would be placed in column 17 and a confidence code of 3 would be placed in column 18 of the analysis table. 

At step 438 at least one of the ratios of ratios is greater than the upper ratio cutoff and the unknown base is called 
10 as the base complementary to the highest ratio of ratios. The code for the base conplementary to the highest ratio of 
ratios would be placed in column 17 and a confidence code of 1 woukJ be placed in column 18 of the analysis table. 

Assume for the purposes of this example that the lower ratio cutoff is 1 .5 and the upper ratio cutoff is 3. Again, the 
ratios of ratios are as follows: 

15 A:A->1.1 

C:C -> 4.3 

G:G->1.0 

20 

T:T -> 0.4 

As all the ratios of ratios are not less than the upper ratio cutoff, the unknown base is called the base complementary 
to the highest ratio of ratios. The highest ratio of ratios is C:C, which has a complementary base G. Thus, the unknown 

25 base is called G which is placed in column 1 7 and a conf idence code of 1 is placed in column 1 8 of the analysis table. 
The exanple shows how the unknown base in the sample nucleic add sequence was correctiy called as base G. 
Although the complementary "mutation" probe associated with the base Q (3'-GCCT) did not have the highest fluores- 
cence intensity, the unknown base was called as base G because the associated "mutation" probe had tiie highest ratio 
Increase over the other "mutation" probes. 

30 Rg. 1 1 A illustrates the high level flow of another implementation of the reference method. As in the previous imple- 
mentation, this implementation also conpares the probe intensities of a reference sequence to the probe intensities of 
a sample sequence. However, this implementation differs conceptually from tiie previous implementation in that neigh- 
boring probe intensities are also analyzed, resulting in more accurate base calling. 

As a simple example of this implementation of the reference metiiod, suppose a reference sequence has a sequence 

35 of 5'-AAACCCAATCCACATCA-3' (SEQ ID N0:6) and a sample sequence has a sequence of S'-AAACCCAQTCCA- 
CATCA-3* (SEQ ID N0:7). where the mutant base is underlined. Thus, tiiere is a mutation of A to G. Suppose further 
that the reference and sample sequences are tiled on chips with the reference sequence being the chip wild-type. This 
implementation of tiie reference method will be described as identifying tiiis nrrutation base. 

For illustration purposes, this implementation of the reference mettiod is described as filling in a data table shown 

40 in Rg. 1 1 B (SEQ ID N0:6. SEQ ID NO:28. SEQ ID NO:29). Alttiough the data table contains more data than is required 
for this inrplementation, the portions of the data table that are produced by steps in Rg. 1 1 A are shown with the same 
reference numerals. The generation of a data table is not necessary, however, and is shown to aid the reader in under- 
standing the method. The mutant base position is at position 241 in the reference and sample sequences, virhich rs 
shown in bold in the data table. 

45 At step 502 the base intensities of the reference and sample sequences are adjusted by subtracting the background 
or "blank" cell intensity from each base intensity Preferably, if a base intensity is then less than or equal to zero, the 
base intensity is set equal to a small positive number to prevent division by zero or negative numbers. In the data table, 
data 502A is tiie background subtracted base intensities for the reference sequence arxl data 502B is tiie background 
subtracted base intensities for the sample sequence (also called the "mutant" sequence in tiie data table). 

50 At step 604 tiie base intensity associated with the reference wild-type is checked to see if it has sufficient intensity 
to call the unknown base. In this example, the reference wikJ-type is base A at position 241 . The base intensity associated 
with the reference wild-type is idaitrfied by a lower case "a" in the left hand column. Thus, the base intensities in tiie 
data table are not identified by their complements and the reference wild-type at the mutation position has an intensity 
of 385. The reference wild-type intensity of 385 is checked by determining if its intensity is greater ttian a predetermined 

55 background difference cutoff. The t)ackground difference cutoff is a number that specifies the intensity the base intensities 
must be over the background intensity in order to correctiy call the unknown base. Thus, the k^se intensity associated 
vintii tiie reference wild-type must be greater than the background difference cutoff or tiie unknown base is not callable. 
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If the base intensity associated with the reference wild-type is not greater than the background difference cutoff, the 
wild-type sequence would fail to have sufficient intensity as shown at step 506. Othenvise, at step 508 the wild-type 
sequence would pass by having sufficient intensity. 

At step 510 calculations are performed on the background subtracted base Intensities of the reference sequence 
5 in order to "normatize" the intensities. Each position in the reference sequence has four background subtracted base 
intensities associated with it. The ratio of the intensity of each base to the sum of the intensities of the possible bases 
(all four) is calculated, resulting in four ratios, one for each base as shown in the data table. Thus, the following ratios 
would be calculated at each position in the reference sequence: 

10 A ratio = A/(A + C + G + T) 

C ratio = C/(A + C + G + T) 

G ratio = G / (A + C + G + T) 

75 

Tratio = T/(A + C + Q + T) 

At position 241. A ratio would be ttie wild-type ratio. These ratios are generally calculated in order to "normalize" the 
intensity data as the photon counts may vary widely from experiment to experiment. Thus, the ratios provide a way of 

20 reconciling the intensity variations across experiments. Preferably, if tiie photon counts do not vary widely from experi- 
ment to experiment, the probe intensities do not need to be "normalized." 

At step 512 tiie highest base intensity associated witii the sample sequence is checked to see If it has sufficient 
intensity to call the unknown t>asa The intensity is checked by determining if the highest intensity sanple base is greater 
than the background difference cutoff. If the intensity is not greater than the background difference cutoff, the sannple 

25 sequence fails to have sufficient Intensity as shown at step 514. Otherwise, at step 516 the sannple sequence passes 
by having sufficient intensity. 

At step 518 calculations are performed on the background subtracted base Intensities of the sample sequence in 
order to "normalize" the intensities. Each position In tiie sanple sequence has four background subtracted base inten- 
sities associated with it. The ratios of the intensity of each base to the sum of tiie intensities of the possible bases (all 
30 four) are calculated, resulting in four ratios, one for each base as shown in tiie data table. 

At step 520 If either the reference or sample sequences failed to have sufficient intensity, the unknown base is 
assigned tiie code N (insufficient Intensity) as shown at step 522. 

At step 524 tiie nomiaiized base intensities of the reference sequence are subtracted from the normalized base 
intensities of the sample sequence. Thus, at each position tiie following calculations are performed: 

35 

A Difference = Sample A Ratio - Reference A Ratio 
C Difference = Sample C Ratio - Reference C Ratio 
40 Q Difference ° Sample G Ratio - Reference G Ratio 

T Difference = Sample T Ratio - Reference T Ratio 

where tiie reference and sanple ratios are calculated at steps 51 0 and 51 8. respectively. The base differences resulting. 

45 from ttiese calculations are shown in tiie data table. 

At step 526 each position Is checked to see If there is a base difference greater tiian an upper difference cutoff and 
a base difference lower than a lower difference cutoff. For example. Fig. 1 1C shows a graph the normalized sample 
base Intensities minus the normalized reference base intensities. Suppose that the upper difference cutoff is 0.15 and 
the lower difference cutoff is -0.1 5 as shown by the horizontal lines in Rg. 1 1 C. At tiie mutation position (labeled with a 

50 reference 0), the G difference Is 0.28 which is greater than 0.15, the upper difference cutoff. Similarly, the A difference 
is -0.32 which Is less ttian -0.15. the lower difference cutoff. As there is a base difference above the upper difference 
cutoff and a base difference below the lower difference cutoff, there may be mutation at this position. 

If there is neither a base difference above the upper difference cutoff nor a base difference below the lower difference 
cutoff, tiie base at tiiat position is assigned the code of the reference wild-type t>ase as shown at step 528. 

55 At step 530 the ratio of the highest background subtracted base intensity in the sample to tiie background subtracted 
reference wild-type base intensity is calculated. For example, at tiie mutation position 241 in the data table, the highest 
background subtracted base intensity in the sanple is 571 (base G). The background subtracted reference wild-type 
base intensity is 385 (base A). The ratio of 571 :385 is calculated and results In 1 .48 as shown in the data table. 
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At step 532 these ratios are conpared to a ratio at a neighboring position. The ratio for the n^ position is subtracted 
from the ratio for the r^ position, wh^e r = n + 1 . For example, at the mutation position 241 in the data table, the ratio 
at position 242 (which equals 1.02) is subtracted from the ratio at position 241 (which equals 1.48). tt has been found 
that a mutant can be confidently detected by analyzing the difference of these neighboring ratios. 

5 Fig. 1 1 D shows other graphs of data in the data table. Of particular importance is the graph identified as 532 because 
this is a graph of the calculations at step 532. The pattern shown in a box in graph 532 has been found to be characteristic 
of a mutation. Thus, if this pattern is detected, the base is called as the base (or bases) with a normalized difference 
greater than the upper difference cutoff as shown at step 536. For example, the pattern was detected and at step 526 
it was shown that base G had a normalized difference of 0.28. which is greater than the upper difference cutoff of 0.1 5. 

10 Therefore, the base at position 241 in the sample sequence is called a base G. which is a mutation from the reference 
sequence (AtoG). 

If the pattem is not detected at step 534, the base at that position is assigned the code of the reference wild-type 
base as shown at step 538. 

This second implementation of the reference method is preferable in some instances as it takes into account probe 
75 intensities of neiglisoring probes. Thus, the first implementation may not have detected the A to G mutation in this 
example. 

The advantage of the reference method is that the correct base can be called even in the presence of significant 
levels of cross-hybridization, as long as ratios of intensities are fairly consistent from experiment to experiment. In prac- 
tice, the number of miscalls and ambiguities is significantly reduced, while the number of con-ect calls is actually 
20 increased, making the reference method very useful for identifying candidate mutations. The reference method has also 
been used to compare the reprodudbility of experiments in terms of base calling. 

IV. Statistical Method 

25 The statistical method is a method of calling bases in a sample nucleic add sequence. The statistical method utilizes 
the statistical variation across experiments to call the bases. Therefore, the statistical method is preferable when data 
from multiple experiments is available and the data is fairly consistent across the experiments. The method compares 
the probe intensities of a sample sequence to statistics of probe intensities of a reference sequence in multiple experi- 
ments. 

30 For sinrplicity. the statistical metiiod will be described as being used to identify one unknown base in a sample 
nudeic acid sequence. In practice, the method is used to identify many or all the bases in a nudeic acid sequence. 

The unknown base will be called by comparing the probe intensities of a sample sequence to statistics on probe 
intensities of a reference sequence in nrrultiple experiments. Generally, the probe intensities of tiie sample sequence 
and the reference sequence experiments are from chips having the same chip wild-type. However, the reference 

35 sequence may or may not be equal to tiie chip wild-type, as it may have mutations. 

A base at the same position in ttte refa-ence and sample sequences will be associated with up to four mutation 
probes and a iDlank" cell. As before, because each mutation probe is identifiable by the mutation base, the mutation 
probes' intensities will be referred to as tiie "base intensities" of their respective mutation bases. 

As a simple example of tfie statistical metiiod, suppose a gene of interest (target) has the sequence 5 -AAAACT- 

40 GAAAA-3* (SEQ ID N0:4). Suppose a reference sequence has tiie sequence 5'-AAAACCGAAAA-3* (SEQ ID N0:5), 
which differs from ttie target sequence by ttie underlined base. Suppose furtiier that a sample sequence is suspected 
to have the same sequence as the target sequence except for a T base mutation at the underlined base position in 5*- 
AAAACIGAAAA-3' (SEQ ID N0:4). Suppose thai in multiple experiments ttie reference sequence is marked and 
exposed to probes on a chip. Suppose further the sample sequence is also marked and exposed to probes on a chip. 

45 The following are complementary "mutation" probes tiiat could be used for a reference experiment and the sample 
sequence: 



Reference 


Sample 


3'-TGAC 


3*-GACT 


3'-TGCC 


3'-GCCT 


3'-TGGC 


3'-GGCT 


3'-TQTC 


3'-QTCT 



The "mutation" probes shown for the reference sequence may be from only one experiment, the other experiments may 
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have different "mutation" probes, chip wild-types, tiling methods, and the like. Although each fluorescence intensity is 
from a probe, since the probes may be identified by their unique mutation bases, the probe intensities may be identified 
by their respective bases as follows: 

5 



Reference 


Sample 


3'-TGAC -> A 
3'-TGCC -> C 
3*-TGGC -> G 
3*-TGTC->T 


3'-GACT->A 
3-GCCT->C 
3*-GGCT->G 
3'-GTCT->T 



TTius. base A of the reference sequence will be described as having an intensity which con-esponds to the intensity of 

the mutation probe with the mutation base A. The statistical metiiod will now be described as calling the unknown base 

in the sample sequence by using this example. 

Fig. 12 illustrates the high level flow of tfie statistical method. At step 602 the four base intensities associated with 
20 the sample sequence and each of the multiple reference experiments are adjusted by subtracting the background or 

"blank" cell intensity from each base intensity. Preferably, if a base intensity is then less than or equal to zero, the base 

intensity is set equal to a small positive number to prevent division by zero or negative numbers. 

At step 604 the intensities of tiie reference wild-type bases in the multiple experiments are checked to see if tiiey 

all have sufficient intensity to call the unknown base. The intensities are checked by determining if the intensity of tiie 
25 reference wild-type base of an experiment is greater than a predetermined background difference cutoff. The wild-type 

probe shown earlier for tiie reference sequence is 3'-TGGC, and thus the G base intensity is the wild-type base intensity. 

These steps are analogous to steps in the other two methods described herein. 

If the intensity of any one of the reference wild-type bases is not greater than the background difference cutoff, the 

wild-type experiments fail to have sufficient intensity as shown at step 606. Othenmse. at step 608 the wild-type exper- 
30 iments pass by having sufficient intensity. 

At step 610 calculations are performed on the background subtracted base intensities of each of tiie reference 

experiments in order to "normalize" the intensities. Each reference experiment has four background subfracted base 

intensities associated with it: one wikJ-type and tiiree for the other possible bases. In this example, the G base intensity 

is the wild-type, the A. C. and T base intensities being the "otiier" intensities. The ratios of the intensity of each base to 
35 the sum of tiie intensities of tiie possible bases (all four) are calculated, giving one wild-type ratio and tiiree "other" ratios. 

Thus, tiie following ratios would be calculated: 

Aratio = A/(A + C + G + T) 
40 C ratio = C/(A + C + G + T) 

Gratio = G/(A + C + G + T) 
TratiooT/(A + C + G + T) 

45 

where G ratio is the wild-type ratio and A. C, and T ratios are the "otiier" ratios. These four ratios are calculated for each 
reference experiment. Thus if the number of reference experiments is n. ttiere would be 4n ratios calculated. These 
ratios are generally calculated in order to "normalize" the intensity data, as tiie photon counts may vary widely from 
experiment to experiment. However, if the probe intensities do not vary widely from experiment to experiment, the probe 

50 intensities do not need to be "normalized." 

At step 61 2 statistics are prepared for the ratios calculated for each of tiie reference experimoits. As stated before, 
each reference experiment will be associated with one wild-type ratio and three "other" ratios. The mean and standard 
deviation are cateulated for all the wild-type ratios. The mean and standard deviation are also calculated for each of the 
other ratios, resulting in tiiree other means and standard deviations for each of the bases that is not the wild-type base. 

55 ITierefore. the following would be calculated: 

Mean and standard deviation of A ratios 
Mean and standard deviation of C ratios 



17 



EP 0 717 113 A2 



Mean and standard deviation of G ratios 
Mean artd standard deviation of T ratios 

5 where the mean and standard deviation of the G ratios are also known as the wild-type mean and the wild-type standard 
deviation, respectively. The mean and standard deviation of the A, C. and T means and standard deviations are also 
known collectively as the ^'otiier'* means and standard deviations. 

Suppose that the preceding calculations produced the following data: 

10 A ratios -> mean = 0.16 std. dev. = 0.003 

C ratios -> mean = 0.03 std. dev. « 0.002 

G ratios -> mean = 0.71 std. dev. = 0.050 

T ratios -> mean = 0.11 std. dev. = 0.004 



15 



In one eni>odiment. tiie steps up to and including step 61 2 are performed in a preprocessing stage for tiie multiple 
wild-type experiments. The results of the preprocessing stage are stored in a file so tiiat the reference calculations do 
20 not have to be repeatedly calculated, improving performance. 

At step 614 the highest base intensity associated witii the sample sequence is checked to see if it has sufficient 
intensity to call the unknown base. TTie intensity is checked by determining if the highest intensity unknown base is 
greater than the background difference cutoff. If the intensity is not greater than the background difference cutoff, the 
sample sequence fails to have sufficient intensity as shown at step 616. Othenwse, at step 618 the sample sequence 
25 passes by having sufficient intensity. 

At step 620 calculations are performed on tfie four background subtracted intensities of the sample sequence. The 
ratios of tiie background subtracted intensity of each base to the sum of the background subtracted intensities of the 
possible bases (all four) are calculated, giving four ratios, one for each base. For consistency, tiie ratio associated witii 
the reference wild-type base is called tiie wild-type ratio, witii there being three "other" ratios. Thus, the following ratios 
30 are calculated: 

Aratios A/(A + C + G + T) 
Cratio = C/(A+C + G + T) 

35 

Gratio = G/(A + C + G + T) 

Tratio = T/(A + C + G + T) 

40 where ratio G is the wild-type ratio and ratios A, C, and T are the "other ratios. 

Suppose the background subtracted intensities associated witii the sanple are as follows: 
A->310 
C->50 
G-> 26 
45 T->100 

Then, the conresponding ratios would be as follows: 

A ratio = 310 / (310 + 50 + 26 + 100) = 0.64 

50 Cratio = 50/(310 + 50 + 26+100) = 0.10 

G ratio = 26 / (310 + 50 + 26 + 100) = 0.05 

T ratio = 100 / (310 + 50 + 26 + 100) = 0.21 



55 



At Step 622 if either the reference experiments or tiie sample sequence failed to have sufficient intercity, the 
unknown base is assigned tiie code N (insufficient intensity) as shown at step 624. 

At step 626 the wild-type and "ottier" ratios associated with the sample sequence are compared to statistical expres- 
sions. The statistical expressions include four predetermined standard deviation cutoffs, one associated with each base. 
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Thus, there is a standard deviation cutoff for each of the bases A, C. G. and T The localized standard deviation cutoffs 
allow the unknown base to be called with higher precision because each standard deviation cutoff can be set to a different 
value. Suppose the standard deviation cutoffs are set as follows: 

A standard deviation cutoff -> 4 

C standard deviation cutoff -> 2 

G standard deviation cutoff -> 8 

T standard deviation cutoff -> 4 

The wild-type base ratio associated with the sample is conpared to a corresponding statistical expression: 

WT ratio a WT mean - (WT std. dev. * WT base std. dev. cutoff) 

where the WT base std. dev. cutoff Is the standard deviation cutoff for the wild-type base. As the wild-type base is G. 
the above comparison solves to the following: 

0.05 & 0.71 -(0.050* 8) 

0.05 a 0.31 

which is not a true expression (0.05 is not greater than 0.31). 

Each of the "other ratios assodated with the sample is compared to a corresponding statistical expression: 

Other ratio > Other mean + (Other std. dev. * Other base std. dev. cutoff) 

where the Other base std. dev cutoff is the standard deviation cutoff for the particular "other" base. Thus, the above 
conparison solves to tfie following three expressions; 

A-> 0.64 > 0.16 + (0.003 M) 

0.64 > 0.17 
C -> 0.10 > 0.03 + (0.002* 2) 

0.10 > 0.03 
T->0.21 > 0.11 +(0.004 * 4) 

0.21 >0.13 

which are all true expressions. 

At step 628 if only the wild-type ratio of the sample sequence was greater than the statistical expression, the unknown 
base Is assigned the code of the reference wild-type base as shown at st^ 630. 

At st^ 632 if one or more of the "other" ratios of the sarrple sequence were greater Vm\ their respective statistical 
expressions, the unknown base is assigned an ambiguity code that indicates the unknown base may be any one of the 
complements of these bases, including the reference wild-type. In tfiis example, ttie "other" ratios for A, 0, and T were 
all greater tiian th^'r conresponding statistical expression. Thus, the unknown base would be called the complements 
of these bases, represented by the sut^et T, G, and A. Thus, the unknown base wouki be assigned the code D (meaning 
"not C"). 

If none of the ratios are greater than their respective statistical expressions, the unknown base is assigned the code 
X (Insufficient disaimination) as shown at step 636. 

The statistical method provides accurate base calling because tt utilizes statistical data from multiple reference 
experiments to call the unknown base. The statistical method has also been used to inrplement confidence estimates 
and calling of mixed sequences. 
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V. Pooiino Processing 

The present invention provides pooling processing which is a method of processing reference and sample nucleic 
acid sequences together to reduce variations across individual experiments. In the representative embodiment discussed 
herein, the reference and sample nucleic add sequences are labeled with different fluorescent markers emitting light at 
different wavelengths. However, the nucleic acids may be labeled with other types of markers including distinguishable 
radioactive markers. 

After the reference and sample nucleic acid sequences are labeled with different color fluorescent markers, the 
labeled reference and sample nucleic acid sequences are then combined and processed together. An apparatus for 
detecting targets labeled with different markers is provided in U.S. Application No. 08/1 95,889 and is hereby incorporated 
by reference for all purposes. 

Rg. 13 illustrates the pooling processing of a reference and sample nucleic acid sequence. At step 702 a reference 
nucleic add sequence is marked with af luorescent dye, such as fluorescein. At step 704 a sample nucleic acid sequence 
is marked with a dye that, upon excitation, emits light of a different wavelength tiian that of the fluorescent dye of the 
reference sequence. For example, tiie sample nucleic acid sequence may be marked with rhodamine. Alternatrvely, the 
sample nudeic add sequence may be marked by attaching biotin to the sample sequence which will subsequently bind 
to streptavidin labeled witii phycoerythrin. Of course, either sequence may be marked witii tiiese or other dyes or other 
kinds of markers (e.g., radioactive) as long as the other sequence is marked with a marker that is distinguishable. 

At step 706 the labeled reference sequence and the labeled sample sequence are combined. After this step, process- 
ing continues in the same manner as for only one labeled sequence. At step 708 tiie sequences are fragmented. The 
fragmented nucleic add sequences are then hybridized on a chip containing probes as shown at step 710. 

At step 712 a scanner generates image files that indicate ttie locations where the labeled nucleic acids bound to 
the chip. There is typically some overlap between ttie two signals. This is corrected for prior to further analysis, i.e.. after 
conection, tiie data files correspond to "reference" and "sample." In general, the scanner generates an image file by 
focusing excitation light on the hybridized chip and delecting the fluorescent light that is emitted. The marker emitting 
the fluorescent light can be identified by the wavelength of the light. For example, ttie fluorescence peak of fluorescein 
is about 530 nm while that of a typical rhodamine dye is about 580 nm. 

The scanner aeates an image file for the data assodated with each fluorescent marker, indicating the locatiorrs 
where ttie correspondingly labeled nudeic acid bound to the chip. Based upon an analysis of the fluorescence intensities 
and locations, it becomes possible to extract information such as tiie monomer sequence of DNA or RNA. 

Pooling processing reduces variations across individual experiments because much of the test environment is com- 
mon. Although pooling processing has been described as being used to improve the conrtoined processing of reference 
and sample nudeic acid sequences, the process may also be used for two reference sequences, two sample sequences, 
or multiple sequences by utilizing multiple distinguishable markers. 

Pooling processing may also be utilized with methods of the present invention of identifying mutations in a sample 
nudeic acid sequence. These methods are highly accurate in identifying single mutations, locating multiple mutations 
and removing false positives for mutations, where a false positive is a base that has erroneously been identified as a 
mutation. These mettiods utilize hybridization data from more tfian one base position to identify the likely position of 
mutations. The interrogation position on the probes is utilized to more accurately identify likely mutations which makes 
more effident use of base calling methods. These mettiods may be advantageously combined witti ttie base calling 
methods described herein to efficiently and accurately sequence a sample nudeic acid sequence. 

As discussed earlier in reference to Rg. 8, ttie fluorescent intensities of cells near an interrogation position having 
a mutation are relatively dark which creates "dark regions'* around ttie mutation. These lower fluorescent intensities 
result because the cells at interrogation positions near a mutation do not contain probes that are pert ectty complementary 
to ttie sample sequence. Thus, ttie hybridization of these probes witti the sample sequence is lower. The characteristics 
of ttiese "dark regions" may be utilized to kJentify mutations and false positives. 

For exanrtple, a sample sequence and a reference sequence were labeled witti different fluorescent markers, in tiiis 
case fluorescein and biotin/|phycoerythrin. The sample and reference sequences are known and ttie sample sequence 
is identical to ttie reference sequence except for mutations at certain known positions. The sample and reference 
sequences were then processing together using tiie pooling processing described above and the sequences were hybrid- 
ized to a chip Including wild-type probes ttiat are perfectly conplementary to ttie reference sequence. The chip induded 
20-mer probes with the interrogation position of each probe being at the 12*^ base position in the probe. 

Rg. 1 4A shows a graph of the scaled fluorescent intensities (photon counts) of the wild-type probes hybridizing with 
the sample and reference sequences. Along the txsttom of the graph are numbers which represent wild-type celt positions 
on the chip. The photon counts of the probes in the v^ki-type celts are plotted on a logarittimic scale of 10". As shown, 
the photon counts range from 1 (representing a de minimus value) and 100,000. The photon counts for the probes in 
the wild-type cell numbered "45" is around 10,000. 

At various wikJ-type cells, ttie photon count for the probes in the cells drc^ to 1 or lower. For example, ttie photon 
counts for wild*type cells numbered 1 1 . 24, 39, etc. are 1 . The low photon counts are due to the fact that there are no 
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probes in these cells. The cells are left 'l}lank" In order to minimize diffraction edges and thus, the location of these blank 
cells is known. Consequently, the intermittent wild-type cells that have a photon count of 1 do not represent erroneous 
data. 

As shown in Fig. 14A, the scaled photon counts for the wild-type probes hybridizing with the sample and reference 
5 sequences are almost the same except for two "bubbles.** A bubble 730 has a top curve defined by the photon counts 
of the wild-type probes that hyt)ridized with the reference sequence and a bottom curve defined by the photon counts 
of the wild-type probes that hybridized with the sample sequence. Following bubble 730. there is a section 732 where 
the photon counts for the wild-type probes hybridizing with the sample and reference sequences are almost the same. 
After section 732 is another bubble 734 which again has a top curve defined by the hybridization of the reference 
10 sequence and the bottom curve defined by the hybridization of the sample sequence. Another partial bubble is shown 
to the right of bubble 734. 

Each bubble in Rg. 14A corresponds to a dark region surrounding a single mutation. Because the wikJ-type probes 
at and sunrounding a mutant position in the sample sequence contain a single base misnnatch with the sample sequence, 
the hybridization is relatively lower which results in lower photon counts. Much infonnation about the sample sequence 

15 may be acquired by a detailed analysts of these bubble regions. 

The width of the bubble indicates whether there is a false positive, a single mutation or a multiple mutation. If there 
is a single mutation, the width of the bubble should be approximately equal to the probe length. For example. Rg. 14A 
was produced utilizing 20-mer probes. Accordingly, bubbles 730 and 734 are approximately 20 wild-type cells wide 
indicating that tiie both these bubbles were produced by single mutations. The width of the dark region resulting from a 

20 single mutation is believed to be approximately equal to tiie probe lengtii because each of tiie probes in tiiis region have 
a single base mismatch witii the sample sequence. 

If tiie widtii of tiie bubble is substantially less tiian tiie probe lengtii, tiie bubble may represent a false positive. For 
example, assume that at wild-type cell number 45 in Rg. 14A, tiie hybridization of the wild-type probe with the sample 
sequence was very lew (e.g., around 1000 photon counts). A base calling algorithm that calls the bases according to 

25 the intensities among the ceils at tiiat position may indicate ttiat there is a mutation at this position. However, the low 
photon counts may be due to dust on the chip and not due to tower hybridization. Since the width of tiiis bubble wouW 
be 1. which is substantially lower than tiie probe widtii of 20, tiie lower photon count at wild-type cell 45 would not be 
due to a mutation (i.e., tiiere is no dark region surrounding that position). 

If tiie widtii of tiie but)ble is substantially more than the probe lengtii. the bubble may represent multiple mutations. 

30 In otiier words, tiie bubble may be produced by more tiian one overlapping dark region. The analysis of such a bubble 
will be discussed in more detail in reference to Rg. 14C. 

Returning to Rg. 14A. each of bubbles 730 and 734 are approximately 20 bases wide indicating wrtii a high degree 
of certainty that each of the bubbles represent a single mutation. Furtiiermore. the bubbles may be analyzed to determine 
the probable location of the mutations within tiie bubbles. As mentioned earlier, the 20-mer probes on the chip had an 

35 interrogation position at the 12*^ base position in the probe. Thus, the bsise at the 12^^ base position is tiie base tiiat 
varies among the related WT% A-, G- and T-cells. Accordingly, tiie mutation should be located at the 12*^ position in 
the bubble. 

The actual mutation in bubble 730 occurs at tiie 12th position (from tiie left). Additonally. tiie actual mutation in 
bubble 734 occurs at tiie 12tii position (from tiie left). Thus, as the graph shows, tiiere are 1 1 bases to the left of each 
40 mutation and 8 bases to tiie right of each mutation. By utilizing the location of the int^rogation position witfiin tiie probes, 
the present invention can help to identify the probable location of a mutation witiiin a dark region or bubble. 

Additionally, because this metiiod identifies specific locations that may have a mutation, more efficient base calling 
may be achieved. For example, an analysis of bubble 730 indicates that there is likely to be a single mutation around 
wiki-type cell 15. Typically, most en-ors in base calling occur in tiie dark regions surrounding a mutation. Many false 
45 positives in tiiis dark zone can now be eliminated because tiiey are incompatible witii tiie bubble size (which indicates 
single mutation, for example). Also, by identifying clearly a *'mismalch zone,** we can now apply algoritiims tiiat factor in 
the effect of a mismatch or multiple mismatches. 

Additionally, the shape of tiie bubble may indicate what mutation has occun-ed. Rg. 14B shows a hypotfietical graph 
of tiie fluorescent intensities vs. cell locations for wild-type probes hybridizing with two sample sequences and one 
so reference sequence. A C-A mismatch will be more destabilizing to probe hybridization than a U-G mismatch. As shown, 
the more destabilizing C-A mismatch results in a larger volume bubble. The shape of the bubble may be utilized to identify 
the particular mutation by pattern matching bubbles stored in a library. 

Fig. 14C shows a graph of the fluorescent intensities (photon counts) of ttie wikJ-type probes hybridizing witii tiie 
sample and reference sequences. A single bubble 750 is flanked on eitiier skle by regions 752 and 754 which do not 
55 contain a mutation. The graph was produced from a chip containing 20-mer probes witii an interrogation position at base 
12 on tiie probes. 

As shown, bubble 750 is 27 bases wide indicating that the bubble was produced from tiie dark regions sun-ounding 
more tiian one mutation as 27 is greater than 20 or tiie length of the prot>es. In addition to providing information that 
there are multiple mutations, analysis of the bubble indicates the probat}le position of two of the mutations. Because tiie 
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interrogation position is at base 12 in the 20-mer probes, one of the mutations should be around 12 bases from the left 
end of the bubble while another nnutations should be around 8 bases from the right end of the bubble. And in fact, there 
is a mutation of T to C at wild-type cell 62 which is 1 2 bases from the left of the bubble. Additionally, there is a mutation 
of A to G at wild-type cell 69 which is 8 bases from the right of the bubble. 

The third and last mutation within but)ble 750 may be identified by performing base calling methods within the bubble. 
Alternatively, the mutation may be Identified by pattern matching bubbles from a library that indicate not only the number 
of mutations but also the specific location and type of mutation. 

Rg. 15 illustrates the high level flow of one embodiment of the present invention that uses the hybridization data 
from more than one base position to identify mutations in a sample nucleic acid sequence. After probe intensities from 
the hybridization of wild-type probes with a sanple and reference sequence are measured, the system identifies a but>ble 
region at step 780. Bubble regions are identified as regions where ttie hybridization of ttie wild-type probes to the sample 
and reference sequence differ signif icantiy. Additionally, the reference sequence should hybrkJize more strongly with the 
wild-type probes since the wild-type probes will be perfectly complementary to tiie reference sequence. 

At step 782, tiie system compares the base width of the bubble to ttie probe length. If the bubble widtti is substantially 
less than the probe lengtii, ttie bubble does not represent a mutation at step 784. The determination of how much less 
the bubble width may vary according to experiment conditions. 

At step 786, tiie system compares tine base width of tiie bubble to tiie probe lengtii to determine if they are approx- 
imately equal. If the bubble width is approximately equal to the probe length, the bubble represents a single base mutation 
at step 788. Again, the determination of how close tiie bubble widtii should be to the probe lengtii may vary according 
to experiment conditions. 

If the bubble widtii is substantially more ttian the probe length, the bubble represents multiple mutations at step 790. 
The system performs base calling at likely locations of mutations at step 792. The likely locations of mutations are 
determined by both tiie widtti of ttie bubble and ttie location of tiie interrogation position on the probes. Additionally, tiie 
system may analyze the pattern of ttie bubble to determine the specific mutations and tiieir positions by analyzing ttie 
pattern of the bubble. The base calling metiiod with the present invention may be ttie intensity ratio mettiod. reference 
method, statistical mettiod, or any ottier mettiod. 

At step 794. ttie system produces confidences that the mutations are identified correctiy. Each confidence is deter- 
mined by how closely the experimental data matched the data expected for the mutation that was called. For example, 
if the bubble width was exactiy the same as tiie probe lengtti and the base calling method identified a mutation at ttie 
interrogation position in tiie probes, tiiere is a very high likelihood or probability that the mutation was identified correctiy. 
The confidence may also be produced according to how closely the bubtrfe pattern matched tiie pattern for tiiat mutation 
or mutations in the library of patterns. 

Although in a prefen-ed embodiment, this method of identifying mutations in a sample nucleic acid sequence is 
utilized in conjunction with pooling processing in order to reduce variations, the method may be utilized witiiout pooling 
processing. For example, ttie mettiod may be utilized effectively where the variations between separate experiments is 
minimized or the data is adjusted accordingly. Therefore, ttiis metfiod is not limited to the embodiment discussed above. 

The present invention provides methods of accurately identifying single mutations, locating multiple mutations and 
removing false positives for mutations. These metfiods are advantageously performed with pooling processing and utilize 
hybridization data from more than one base position to identify ttie likely position of mutations. The interrogation position 
on ttie probes is also utilized to more accurately identify tfie likely position of mutations which makes more effident use 
of base calling methods. 

VI. Comparative Analvsis fViewSeQ^ 

The present invention provides a mettiod of comparative analysis and visualization of multiple experiments. The 
method allows ttie intensity ratio, reference, and statistical metiiods to be run on multiple datafiles simultaneously. This 
pemiits different experimental conditions, sample preparations, and analysis parameters to be compared in terms of 
their effects on sequence calling. The mettiod also provides verification and editing functions, which are essential to 
reading sequences, as well as navigation and analysis tools. 

Rg. 16 illusti-ates tiie main screen and tiie associated pull down menus for comparative ar^ysis and visualization 
of multiple experiments (SEQ ID N0:8 and SEQ ID N0:9). The windows shown are from an appropriately programmed 
Sun Worictation. However, ttie comparative analysis software may also be implemented on or ported to a personal 
conputer, including IBM PCs and compahljles, or other workstation environments. A window 802 is shown having pull 
down menus for the following functions: RIe 804. Edit 806. View 808. Highlight 810, and Help 812. 

The main section of the window is divided into a reference sequence area 814 and a sample sequence area 816. 
The reference sequence area is where known sequences are displayed and is divided into a reference name subarea 
81 8 and reference base subarea 820 . The reference name subarea is shown witti ttie filenames that contain the reference 
sequences. The chip wild-type is identified by ttie filename with ttie extension ".wt#" where the # indicates a unit on tiie 
chip. The reference base subarea contains the bases of the reference sequences. A capital C 822 is displayed to the 
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right of the reference sequence that is the chip wild-type for the current analysis. Although the chip wild-type sequence 
has associated fluorescence intensities, the other reference sequences shown below the chip wild-type may be known 
sequences that have not been tiled on the chip. These may or may not have associated fluorescence intensities. The 
reference sequences other than the chip witd-type are used for sequence conparisons and may be in the form of simple 
5 ASCII text files. 

Sanple sequence area 816 is where sanple or unknown experimental sequences are displayed for comparison 
with the reference sequences. The sample sequence area is divided into a sample name subarea 824 and sample base 
subarea 826. The sample name subarea is shown with filenames that contain the sanple sequences. The filename 
extensions indicate the method used to call the sample sequence where ".cqr denotes the intensity ratio method, 

10 denotes the reference method, and ".sqr denotes the statistical method (# indicates the unit on the chip). The sample 
base subarea contains the bases of the sample sequences. The bases of the sample sequences are identified by the 
codes previously set forth which, for the most part, conform to the lUPAC standard. 

Window 802 also contains a message panel 828. When the user selects a base with an input device in the reference 
or sample base subarea, the base becomes highlighted and the pathname of the file containing the base is displayed 

75 in the message panel. The base's position in the nucleic add sequence is also displayed in the message panel. 

In pull down menu File 804, the user is able to load files of experimental sequences that have been tiled and scanned 
on a chip. There is a chip wild-type associated with each experimental sequence. The chip wild-type associated with 
the first experimental sequence loaded is read artd shown as the chip wild-type in reference sequence area 814. The 
user is also able to load files of known nucleic add sequences as reference sequences for comparison purposes. As 

20 before, these known reference sequences may or may not have assodated probe intensity data. Additionally, in this 
menu the user is able to save sequences that are selected on the screen into a project file that can be loaded in at a 
later time. The project file also contains any linkage of the sequences, where sequences are linked for comparison 
purposes. Sequences to be saved, both reference and sample, are chosen by selecting the sequence filename with an 
input device in the reference or sample name subareas. 

25 In pull down menu Edit 806. the user is able to link together sequences in the reference and sample sequence 
areas. After the user has selected one reference and one or more sample sequences, the sample sequences can be 
linked to the reference sequence by selecting an entry in the pull down menu. Once the sequences are linked, a link 
nuvrber 830 is displayed next to each of sequences of related interest. Each group of linked sequences is associated 
with a unique link number, so the user can easily identify which sequences are linked together. Linking sequences 

30 permits the user to more easily compare the linked sequences. The user is also able to remove and display links from 
this menu. 

In pull down menu View 808, the user is able to display intensity graphs for selected bases. Once a base is selected 
in the reference or sample base subareas, the user may request an intensity graph showing the hybridized probe inten- 
sities of the selected base and a delineated neighborhood of bases near the selected base. Intensity graphs may be 

35 displayed for one or multiple selected bases. The user is also able to prepare comment files and reports in this menu. 
Fig. 1 7 illustrates an intensity graph window for a selected base at position 120 (SEQ ID NO:30 and SEQ ID N0:31). 
The filename containing the sequence data is displayed at 904. The graph shows tiie intensities for each of the hybridized 
probes associated witti a base. Each grouping of four vertical bars on the graph, which are labeled as "a", "c", "g". and 
T on line 906, shows the background subtracted intensities of probes having the indicated substitution base. In one 

40 embodiment, the call^ bases are shown in red. The wild-type base is shown at line 908, tiie called base is shown at 
line 910, and the base position is shown at line 912. In Fig. 17, tiie base selected is at position 120, as shown by arrow 
914. The wild-type base at this position is T; however, the called base is M which means the base is eitfier A or 0 (amino). 
The user is able to use intensity graphs to visually compare the intensities of each of the possible calls. 

Rg. 18 illustrates multiple intensity graph windows for selected bases (SEQ ID NO:32, SEQ ID NO:33, SEQ ID 

45 NO:34, and SEQ ID NO:35). There are three intensity graph windows 1002, 1004, and 1006 as shown. Each window 
may be associated with a different experiment where tiie sequence analyzed in the experiment may be eitiier a reference 
(if it has associated probe intensity data as in the chip wikl-type) or a sample sequence. The windows are aligned and 
a rectangular box 1008 shows the selected bases* position in each of the sequences (position 162 in Rg. 18). The 
rectangular box aids the user in Identifying tiie selected bases. 

50 Referring again to Fig. 1 6. in pull down menu Highlight 81 0, ttie user is able to compare tiie sequences of references 
and samples. At least four conparisons are available to the user, including tiie following: sample sequences to tiie chip 
wild-type sequence, sanple sequences to any reference sequences, sample sequences to any linked reference 
sequences, and reference sequences to tiie chp wild-type sequence. For example, after the user has linked a reference 
and sample sequence, the user can compare the bases in the linked sequences. Bases in the sanple sequence that 

55 are different from tiie reference sequence will tiien be indicated on tiie display device to the user (e.g.. base is shown 
in a different color). In anotiier exanple. the user is able to perform a comparison that will help identify sample sequences. 
After a sanple is linked to multiple reference sequences, each base in tiie sample sequence that does not match the 
wild-type sequence is checked to see if it matches one of the linked reference sequences. The bases that match a linked 
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reference sequence will then be indicated on the display device to the user. The user may then nrore easily identify the 
sanrple sequence as being one of the reference sequences. 

In pull down menu Help 812, the user is able to get information and instnjctions regarding the comparative analysis 
program, the calling methods, and the lUPAC definitions used in the program. 

Rg. 1 9 illustrates the intensity ratio method correctly calling a mutation in solutions with varying concentrations (SEQ 
ID NO:10. SEQ ID N0:11, SEQ ID N0:12. SEQ ID N0:13, SEQ ID N0:14, SEQ ID N0:15. SEQ ID N0:16. SEQ ID 
N0:17, and SEQ ID N0:18). A window 1102 is shown with a chip wild-type 1104 and a mutant sequence 1106. The 
mutant sequence differs from the chip wild-type at the position indicated by the rectangular box 1 108. The chip wild- 
type and mutant sequences are a region of HIV Pol Gene spanning mutations occuning in AZT drug therapy. 

There are seven sanrple sequences that are called using the intensity ratio method. The sanrple sequences are 
actually solutions of different proportions of the chip wild-type sequence and the mutant sequence. Thus, there are 
sample solutions 1 1 1 0, 1 1 1 2. 1 1 1 4. 1 1 1 6, 1 1 1 8, 1 120, and 1 1 22. The solutions are 1 5-mer tilings across the chip wild- 
type with increased percentages of the mutant sequence from 0 to 1 00% by weight The following shows tiie proportions 
of tiie sanrple solutions: 



Sample Solution 


Chip Wild-Type:Mutant 


1110 


100:0 


1112 


90:10 


1114 


75:25 


1116 


50:50 


1118 


25:75 


1120 


10:90 


1122 


0:100 



For exanrple. sample solution 1114 contains 75% chip wild-type sequence and 25% mutant sequence. 

Now referring to the bases called in rectangular box 1 1 08 for the sanple solutions, the intensity ratio method con-ectly 
calls sample solution 1 1 1 0 as having a base A as in the chip-wild type sequence. This is correct because sample solution 
1 1 10 is 100% chip wild-type sequence. The inteisity ratio method also calls sample solution 11 12 as having a base A 
because the sanrple solution is 90% chip wild-type sequence. 

The intensity ratio method calls the identified base in sanple solutions 1 1 14 and 1 1 16 as being an R, which is an 
ambiguity lUPAC code denoting A or G (purine). This also a correct base call because the sanple solutions have from 
75% to 50% chip-wild type sequence and from 25% to 50% mutation sequence. Thus, ttie intensity ratio method correctiy 
calls the base in this transition state. 

Sample solutions 11 1 8, 1 1 20, and 1 1 22 are called by the intensity ratio method as having a mutation base G at tiie 
specified location. This is a con-ect base call because the sample solutions prinrBrily consist of the nmrtation sequence 
(75%. 90%, and 100% respectively). Again, the intensity ratio method correctiy called the bases. 

These experiments also show that the base calling methods of the present invention may also be used for solutions 
of wore tiian one nucleic add sequence. 

Rg. 20 illustrates the reference metiiod conrectiy calling a mutant base where tiie intensity ratio method incorrectiy 
called the mutant base (SEQ ID NO:36, SEQ ID NO:37, SEQ ID NO:38, and SEQ ID NO:39). There are three intensity 
graph windows 1202, 1204, and 1206 as shown. TTie windows are aligned and a rectangular box 1208 outiines the 
bases of interest Window 1202 shows a sample sequence called using ttie intensity ratio metiiod. However, the base 
in tiie rectangular box 1208 was incon-ectiy called base C, as there Is actually a base A at ttiat position. The intensity 
ratio metiiod incorrectiy called tiie base as C because the probe intensity associated with base C is much higher tiian 
the otiier probe intensities. 

Window 1204 shows a reference sequence called using tiie Intensity ratio method. As the reference sequence is 
known, it is not necessary to know the method used to call the reference sequence. However, it is inrportant to have 
probe intensities for a reference sequence to use the reference metiiod. The reference sequence is called a base C at 
the position indicated by the rectangular box. 

Window 1 206 shows ttie sanple sequence called using the reference metiiod. The reference method con-ectly calls 
the specified base as being base A. Thus, for some cases the reference method is preferable to tiie intensity ratio method 
because it compares probe intensities of a sanple sequence to probe intensities of a reference sequence. 
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VII. Examples 
Example 1 

5 The intensity ratio method was used in sequence analysis of various polymorphic HIV-1 clones using a protease 
chip. Single stranded DNA of a 382 nt region was used with 4 different clones (HXB2» SF2. NY5. pPol4nrujt18). Results 
were compared to results from an ABI sequencer. The results are illustrated below: 





ABI 


Protease Chip 




Sense 


Antisense 


Sense 


Antisense 


No call 


0 


4 


9 


4 


Ambiguous 


6 


14 


17 


8 


Wrong call 


2 


3 


3 


1 


TOTAL 


8 


21 


29 


13 


SUMMARY 
ABI (sense) - 99.5% 
Chip (sense) - 98.1% 
ABI (antisense) - 98.6% 
Chip (antisense) - 99.1% 



Example 2 

30 

HIV protease genotyping was performed i^ing the described chips and CallSeq™ intensity ratio calculations. Sam- 
ples were e/aluated from AIDS patients before and after ddl treatment. Results were confirmed with ABI sequencing. 

Fig. 21 illustrates the output of the ViewSeq™ program with four pretreatment samples and four posttreatment sam- 
ples (SEQ ID NO:22. SEQ ID NO:23, SEQ ID N024, SEQ ID NO:25, SEQ ID NO:26, and SEQ ID N027). Note the 
35 base change at position 207 where a mutation has arisen. Even adjacent two additional mutations (gt), the "a" mutation 
has been properly detected. 

The above desaiption is illustrative and not restrictive. Many variations of the invention will become apparent to 
those of skill in the art upon review of this disclosure. Merely by way of example, while the invention is illustrated with 
particular reference to the evaluation of DNA (natural or unnatural), the methods can be used in the analysis from chips 
40 with otfier materials synttiesized tiiereon, such as RNA. The scope of the Invention should, therefore, be determined 
not with reference to the above description, but instead should be determined with reference to the appended claims 
along with tiieir full scope of equivalents. 
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SEQUENCE LISTING 



(1) GENERAL INFORMATION: 

(i) APPLICANT: 

(A) NAME: Affymax Technologies N.V. 

(B) STREET: De Ruyderkade 62 

( C) CITY : Curacao 

(E) COUNTRY: Netherlands Antilles 

(F) POSTAL CODE (ZIP) : none 

(ii) TITLE OF INVENTION: Computer-Aided Visualization and 
Analysis System for Sequence Evaluation 

(iii) NUMBER OF SEQUENCES: 39 

(iv) COMPUTER READABLE FORM: 

(A) MEDIUM TYPE: Floppy disk 

(B) COMPUTER: IBM PC compatible 

(C) OPERATING SYSTEM: PC-DOS /MS -DOS 

(D) SOFTWARE: Patentin Release #1.0, Version #1.25 (EPO) 



(2) INFORMATION FOR SEQ ID N0:1: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID N0:1: 
ATGTGGACAG TTGTA 15 
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(2) INFORMATION FOR SEQ ID NO: 2: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 2: 
ATGTGGATAG TTGTA 



(2) INFORMATION FOR SEQ ID NO: 3: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 3: 
ATGTGGAKAG TTGTA 
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(2) INFORMATION FOR SEQ ID NO: 4: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 4: 
AAAACTGAAA A 



(2) INFORMATION FOR SEQ ID NO: 5: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 5: 
AAAACCGAAA A 
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(2) INFORMATION FOR SEQ ID NO: 6: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 6: 
AAACCCAATC CACATCA 



(2) INFORMATION FOR SEQ ID NO: 7: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 7: 
AAACCCAGTC CACATCA 
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(2) INFORMATION FOR SEQ ID NO: 8: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 31 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 8: 
GGGGAAGCAG ATTTGGGTAC CACCCAAGTA T 



(2) INFORMATION FOR SEQ ID NO: 9: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 31 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 9: 
GGGGAAGCAG ATTTGAAMAC CACCCAAGTA T 
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(2) INFORMATION FOR SEQ ID NO: 10: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 10: 

GCATTAGTAG AGATATGTAC AGAAATGGAA AAGGAAGGGA AAATTTCAAA 
AATTGGGCC 



(2) INFORMATION FOR SEQ ID NO: 11: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 11: 

GCATTAGTAG AAATTTGTAC AGAGATGGAA AAGGAAGGGA AAATTTCAAA 
AATTGGGCC 
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(2) INFORMATION FOR SEQ ID NO: 12: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 12: 

GCATTAGTAG AGATATGGAG AGRARDGGRA AXXXAAGGGA AAATTNNNAA 
AATTGGGCC 



(2) INFORMATION FOR SEQ ID NO: 13: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 13: 

GCATTAGTAG AGATATGKAS AGRARDGGRA AXXXAAGGGA AAAKTNNNAA 
AATTGGGCC 
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(2) INFORMATION FOR SEQ ID NO: 14: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 14: 

GCATTAGTAG AGATATGKAS AGRRRDGGRA AXXXAAGGGA AAADTYNNAA 
AATTGGGCC 



(2) INFORMATION FOR SEQ ID NO: 15: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 15: 

GCATTAGTAG AGATATGTAS AGRRADGGAA AXGGAAGGGA AAATTNNNNA 
AATTGGGCC 
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(2) INFORMATION FOR SEQ ID NO: 16: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 16: 

GCATTAGTAG AGATATGTAC AGRGAGGGAA AXGGAAGGGA AAATTNNNNA 
AATTGGGCC 



(2) INFORMATION FOR SEQ ID NO: 17: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 17: 

GCATTAGTAG AGATATGTAS AGRGAGGGAA AXGGAAGGGA AAATTNNNNA 
AATTGGGCC 
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(2) INFORMATION FOR SEQ ID NO: 18: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 18: 

GCATTAGTAG GAGGNNNGAC AGGGRKGGAA AXXMAAGGGA AAAKTNNNAA 
AATTGGGCC 



(2) INFORMATION FOR SEQ ID NO: 19: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 19: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATCTT CTTTACTTAA 
ACGGTCCTTT 

TACCTTTGGT TTTTACTATC CCCCTTAACC TCCAAAATAG TTTCATTCTG 
TCATGCTAGT 

CTATGGACAT CTTTAGACAC CTGTATTTCG ATATCCATGT 
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(2) INFORMATION FOR SEQ ID NO: 20: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



15 (xi) SEQUENCE DESCRIPTION: SEQ ID NO: 20: 

NNGAGATANN NTATGTCCTC GTCYACTATG TNANNNNNNN NNNNNNNNAA 
ACGGTCCTNN 60 

20 NNNNNNNNNN NNNNNNNNNN CNNCNTAACC TCCAAAATAN NNNNNNTCTN 

NNNNANNNNT 120 

CTANNNGNAG NNNNAGANAR NCCNNNNNNN NNATNCATGT 160 

25 



(2) INFORMATION FOR SEQ ID NO: 21: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

55 (C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

40 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 21: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATNNN NNNNACTTAA 
45 ACGGTCCTTT 60 

TACCTTTGGT TTTTACTATC CCCCTTAACC TCCAAAATAG TTTCATTCTG 
NCATANNAGT 120 

50 CTATGNGNNG NNNTAGACAG NCCNNNNTCG ATATCCATGT 160 
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(2) INFORMATION FOR SEQ ID NO: 22: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 22: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATCTT CTTTACTTAA 
ACGGTCCTTT 

TACCTTTGGT TTTTACTATC CNNCTTAACC TCCAAAATAG TTTCATTCTG 
TCATACTAGT 

CTATGGGTAG CTTTAGACCN CCGTATTTCG ATATCCATGT 



(2) INFORMATION FOR SEQ ID NO: 23: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 23: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATCTT CTTTACTTAA 
ACGGTCCTTT 

TACCTTTGGT TTTTACTATC CCNCTTAACC TCCAAAATAG TTTCATTCTG 
TCATACTAGT 

CTATGGGTAG CTTTAGACCC CCGTATTTCG ATATCCATGT 
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10 



(2) INFORMATION FOR SEQ ID NO: 24: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



'5 (xi) SEQUENCE DESCRIPTION: SEQ ID NO: 24: 

NCGGGATANT NTATGTCCTC GTCYACTATG TCANNNNNCN NNCNNNNCAA 
ACGGTCCNCC 60 

20 NNNNNCNNNN NNCNNCYANG AANCYCAACC TCCAAAATAN NNNNNNTCTN 

NNNNANNNCN 120 

CTNNNNNNAG NGNNAGACAC CTGTATNNNN NTATNCAYGT 160 

25 



30 (2) INFORMATION FOR SEQ ID NO: 25: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

35 (C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

40 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 25: 

TCGRGATAAT CTATGTCCTC GTCTACTATG TCATAATCCN NNCNNCTCAA 
45 ACGGTCCTYC 60 

CNNNNYTGGT TNYTACTATC CCCCTTAACC TCCAAAATAG TTTCATTCTG 
NCATACNNST 120 

so CTANNNNNAG NGTTAGACAC CTGTATTTCG ATATCCATGT 160 
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(2) INFORMATION FOR SEQ ID NO: 26: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 26: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATCCN NCCTACTCAA 
ACGGTCCTTC 

TACCTTTGGT TTTTACTATC CMCCTTAACC TCCAAAATAG TTTCATTCTG 
TCATACTAGT 

CTATGAGTAG CTTTAGACAC CTGTATTTCG ATATCCATGT 



(2) INFORMATION FOR SEQ ID NO: 27: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 27: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATCTT CTTTACYCAA 
ACGGTCCTXC 

TACCTTTGGT TTTTACTATC CCMCTTAACC TCCAAAATAG TTTCATTCTG 
TCATACTAGT 

CTATGAGTAG CTTTAGACAC CTGTATTTCG ATATCCATGT 
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(2) INFORMATION FOR SEQ ID NO: 28: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 28: 
AAACCCAATC CACATCM 



(2) INFORMATION FOR SEQ ID NO: 29: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO:29: 
MMACNCANNC CACANNM 



40 



EP 0 717 113 A2 



(2) INFORMATION FOR SEQ ID NO: 30: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 30: 
TTGGGTACCA C 



(2) INFORMATION FOR SEQ ID NO: 31: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 31: 
TTGAAMACCA C 
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(2) INFORMATION FOR SEQ ID NO: 32: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 32: 
ACAGAAATGG A 



(2) INFORMATION FOR SEQ ID NO: 33: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 33: 
AGAGRATDGG R 
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(2) INFORMATION FOR SEQ ID NO: 34: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 34: 
ASAGRRADGG A 



(2) INFORMATION FOR SEQ ID NO: 35: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 35: 
ACAGGGRRGG A 
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(2) INFORMATION FOR SEQ ID NO: 36: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 36: 
CTGGGGGGTA T 



(2) INFORMATION FOR SEQ ID NO: 37: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 37: 
CTGGCCSGTG T 



(2) INFORMATION FOR SEQ ID NO: 38: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 38: 
CTGGGCGGTA T 
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(2) INFORMATION FOR SEQ ID NO: 39: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleqtide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 39: 
CTGGCACGTG T 11 



Claims 

1. In a computer system, a methcxJ of identifying an unknown base in a sample nudeic add sequence, said method 
conprising the steps of: 

inputting a plurality of probe intensities, each of said probe Intensities being associated with a nucleic acid 

probe; 

said computer system conparing said plurality of probe intensities wherein each of said plurality of probe 
intensities is substantially proportional to said associated nudeic acid probe hybridizing with at least one nudeic 
acid sequence, said at least one nucleic acid sequence including said sample sequence; artd 

calling said unknown base according to results of said conparing step. 

2. In a computer system, a method of identifying an unknown base in a sample nudeic add sequence, said method 
connprising the steps of: 

inputting a plurality of probe intensities, each of said probe intensities being associated wttii a nucleic acid 

probe; 

said computer system comparing said plurality of probe intensities wherein each of said plurality of probe 
intensities is substantially proportional to said associated nucleic acid probe hybridizing witii said sanple sequence; 
and 

calling said unknown base according to results of said comparing st^. 

3. The mettKxi of daim 2, wherein said comparing step includes ttie step of saW computer system calculating a ratio 
of a higher probe intensity to a lower pr6be intensity. 

4. The method of daim 3. wherein said calling step indudes the step of calling said unknown base according to said 
probe assodated witii said higher probe intensity if said ratio is greater than a predetermined ratio value. 

5. The method of claim 4, wherein said predetemnined ratio value is approximately 1 .2. 

6. In a computer system, a method of identifying an unknown base in a sample nudeic add sequence, said metiiod 
conrprising the steps of: 

inputting a first set of probe intensities, each of said probe intensities in said first set being associated with 
a nucleic add probe and substantially proportional to said assodated nucldc acid probe hybridizing with a reference 
nucleic add sequence; 

inputting a second set of probe intensities, each of said probe intensities in said second set being associated 
with a nucleic add probe and substantially proportional to said assodated nucleic acid probe hybridizing with said 
sarrple sequence; 
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said computer system comparing at least one of said probe intensities in said first set and at least one of 
said probe intensities in said second set; and 

calling said unknown base according to results of said comparing step. 

7. The method of claim 6, wherein said comparing step includes the steps of: 

calculating first ratios of a wild-type probe intensity to each probe intensity of a probe hybridizing with said 
reference sequence, wherein said wild-type probe intensity is associated with a wild-type probe; and 

calculating second ratios of the highest probe intensity of a probe hybridizing with said sample sequence to 
each probe intensity of a probe hybridizing with said sample sequence. 

8. The method of daim 7, wherein said comparing step further includes the step of calculating third ratios of said first 
ratios to said second ratios. 

9. TTie method of claim 8, wherein said calling step includes tiie step of calling said unknown base according to said 
probe associated with a highest third ratio. 

10. The method of claim 6, wherein said comparing step includes the step of calculating a ratio of a highest probe 
intensity in said first set to a highest intensity in said second set. 

1 1 . The metiiod of claim 10, wherein said comparing step further includes tfie step of comparing said ratio of neighboring 
nucleic add probes. 

12. In a computer system, a method of identifying an unknown base in a sample nudeic add sequence, said method 
comprising the steps of: 

inputting statistics about a plurality of experiments, each of said experiments produdng probe intensities 
each being associated witii a nudeic acid probe and substantially proportional to said associated nucleic acid probe 
hybridizing witti a reference nudeic add sequence; 

inputting a plurality of probe intensities, each of said plurality of probe intensities being associated witii a 
nucleic acid probe and substantially proportional to said associated nucleic acid probe hybridizing with said sample 
sequence; 

said computer system comparing at least one of said plurality of probe intensities witii said statistics; and 
calling said unknown base according to results of said comparing step. 

13. The method of claim 12, furtfier comprising the st^ of calculating said statistics. 

14. The method of claim 12. wherein said statistics indude a mean and standard deviation. 

15. A metiiod of processing first and second nudeic acid sequences, comprising the steps of: 

providing a plurality of nucleic acid probes; 

labeling said first nudeic acid sequence witii a first marker; 

labeling said second nucleic acid sequence with a second marker; and 

hybridizing said first and second labeled nudeic add sequences at ttie same time. 

16. The method of daim 15. wherein said plurality of nucleic acid probes are on a chip. 

17. The metiiod of claim 15, further comprising tiie step of fragmenting said first and second nudeic acid sequences 
at the same time. 

18. The method of claim 15, furtiier comprising the st^ of scanning for said first and second markers on said chip, said 
first and second labeled nucleic acid sequences being on said chip. 

1 9. The method of daim 1 5. wherein said first and second nrwkers are fluorescent markers that emit light at different 
wavelengths upon excitation. 

20. In a computer system, a method of identifying mutations in a sample nucleic acid sequence, said method comprising 
the steps of: 

inputting a first set of probe intensities, each of said probe intensities in said first set being associated witii 
a nudeic add probe and substantially proportional to said associated nudeic ackl probe hybridizing with a reference 
nucleic add sequence; 
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inputting a second set of probe intensities, each of said probe intensities in said second set being associated 
with a nucleic acid probe and substantially proportional to said associated nucleic acid probe hybridizing with said 
sannple sequence; 

said computer system comparing probe intensities in said first set and probe intensities in said second set 
5 to select hybridization regions where said probe intensities in said first set and said probe intensities in said second 
set differ; and 

identifying mutations according to characteristics of said selected regions. 

21 . The method of claim 20. wherein said selected regions are determined by comparing probe intensities of wild-type 
10 probes. 

22. The method of daim 21, wherein said wild-type probes are conrplementary to a portion of said reference sequence. 

23. The method of claim 21 , wherein said identifying step further includes the steps of: 
15 analyzing a size of a selected region; 

identifying a likely position of a mutation in said selected region according to an intenrogation position of said 
nucleic add probes; and 

pertonning base calling at said likely position. 

20 24. In a computer system, a method of analyzing a plurality of sequences of bases, said plurality of sequences including 
at least one reference sequence and at least one sample sequence, the method comprising the steps of: 
displaying said at least one reference sequence in a first area on a display device; and 
displaying said at least one sample sequence in a second area on said display device; 
whereby a user is capable of visually comparing said plurality of sequences. 

25 

25. The method of claim 24, wherein said plurality of sequences are nrionomer strands of DNA or RNA. 

26. The method of daim 24, wherein said at least one reference sequence indudes a chip wild-type that has been tiled 
on a chip. 

27. The method of claim 26, wherein said chip wild-type sequence is displayed as a first sequence in said first area. 

28. The method of daim 26. further comprising the step of displaying a label in said first area to identify said chip wild- 
type sequence. 

29. The method of claim 24, wherein said at least one sanple sequence has been hybridized on a chip. 

30. The method of claim 24. further comprising the step of indicating bases that differ among a plurality of user selected 
sequences. 

31. The method of claim 24. further comprising the steps of: 

displaying a name associated with each of said at least one reference sequence in said first area; and 
displaying a name associated with each of said at least one sample sequence in said second area. 

45 32. The method of claim 24, further comprising the step of linking at least one reference sequence in said first area with 
at least one sample sequence in said second area. 

33. The method of daim 32, further comprising the step of indicating on said display device which sequences are linked. 

50 34. The metiiod of daim 24, further comprising the st^ of indicating bases of said at least one sample sequence that 
are not equal to a corresponding base in said at least one reference sequence. 

35. The method of claim 24, wherein said at least one reference sequence and said at least one sample sequence are 
aligned on said display device, hybridization with said probes. 

55 
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ARRAYS OF NUCLEIC ACID PROBES ON BIOLOGICAL CHIPS 

5 Cross-Ref erence to Related Application 

This application is a continuation-in-part of USSN 
08/284,064, filed August 2, 1994, which is a continuation-in- 
part of USSN 08/143,312, filed October 26, 1993, each of which 
is incorporated by reference in its entirety for all purposes • 
10 Research leading to the invention was funded in part by NIH 

grant No. 1R01HG00813-01, and the government may have certain 
rights to the invention. 

Background of the Invention 
15 Field of the Invention 

The present invention provides arrays of oligonucleotide 
probes immobilized in microf abricated patterns, on silica chips 
for analyzing molecular interactions of biological interest. 
The invention therefore relates to diverse fields impacted by 
20 the nature of molecular interaction, including chemistry, 
biology, medicine, and medical diagnostics. 

Description of Related Art 

Oligonucleotide probes have long been used to detect 

25 complementary nucleic acid sequences in a nucleic acid of 

interest (the "target" nucleic acid) . In some assay formats, 
the oligonucleotide probe is tethered, i.e., by covalent 
attachment, to a solid support, and arrays of oligonucleotide 
probes immobilized on solid supports have been used to detect 

30 specific nucleic acid sequences in a target nucleic acid. 
See, e.g., PCT patent publication Nos. WO 89/10977 and 
89/1154 8. Others have proposed the use of large numbers of 
oligonucleotide probes to provide the complete nucleic acid 
sequence of a target nucleic acid but failed to provide an 

35 enabling method for using arrays of immobilized probes for 
this purpose. See U.S. Patent Nos. 5,202,231 and 5,002,867 
and PCT patent publication No. WO 93/17126. 
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The development of VLSIPS™ technology has provided 
methods for making very large arrays of oligonucleotide probes 
in very small arrays. See U.S. Patent No. 5,143,854 and PCT 
patent publication Nos. WO 90/15070 and 92/10092, each of 
which is incorporated herein by reference. U.S. Patent 
application Serial No. 082,937, filed June 25, 1993, describes 
methods for making arrays of oligonucleotide probes that can 
be used to provide the complete sequence of a target nucleic 
acid and to detect the presence of a nucleic acid containing a 
specific nucleotide sequence. 

Microf abricated arrays of large numbers of 
oligonucleotide probes, called "DNA chips" offer great promise 
for a wide variety of applications. New methods and reagents 
are required to realize this promise, and the present 
invention helps meet that need. 

SUMMARY OF THE INVENTION 

The invention provides several strategies employing 
immobilized arrays of probes for comparing a reference 
sequence of known sequence with a target sequence showing 
substantial similarity with the reference sequence, but 
differing in the presence of, e.g., mutations. In a first 
embodiment, the invention provides a tiling strategy employing 
an array of immobilized oligonucleotide probes comprising at 
least two sets of probes. A first probe set comprises a 
plurality of probes, each probe comprising a segment of at 
least three nucleotides exactly complementary to a subsequence 
of the reference sequence, the segment including at least one 
interrogation position complementary to a corresponding 
nucleotide in the reference sequence. A second probe set 
comprises a corresponding probe for each probe in the first 
probe set, the corresponding probe in the second probe set 
being identical to a sequence comprising the corresponding 
probe from the first probe set or a subsequence of at least 
three nucleotides thereof that includes the at least one 
interrogation position, except that the at least one 
interrogation position is occupied by a different nucleotide 
in each of the two corresponding probes from the first and 
second probe sets. The probes in the first probe set have at 
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least two interrogation positions corresponding to two 
contiguous nucleotides in the reference sequence. One 
interrogation position corresponds to one of the contiguous 
nucleotides, and the other interrogation position to the 
5 other . 

In a second embodiment, the invention provides a tiling 
strategy employing an array comprising four probe sets. A 
first probe set comprises a plurality of probes, each probe 
comprising a segment of at least three nucleotides exactly 

10 complementary to a subsequence of the reference sequence, the 
segment including at least one interrogation position 
complementary to a corresponding nucleotide in the reference 
sequence. Second, third and fourth probe sets each comprise a 
corresponding probe for each probe in the first probe set. 

15 The probes in the second, third and fourth probe sets are 
identical to a sequence comprising the corresponding probe 
from the first probe set or a subsequence of at least three 
nucleotides thereof that includes the at least one 
interrogation position, except that the at least one 

20 interrogation position is occupied by a different nucleotide 
in each of the four corresponding probes from the four probe 
sets. The first probe set often has at least 100 
interrogation positions corresponding to 100 contiguous 
nucleotides in the reference sequence. Sometimes the first 

25 probe set has an interrogation position corresponding to every 
nucleotide in the reference sequence. The segment of 
complementarity within the probe set is usually about 9-21 
nucleotides. Although probes may contain leading or trailing 
sequences in addition to the 9-21 sequences, many probes 

30 consist exclusively of a 9-21 segment of complementarity. 

In a third embodiment, the invention provides immobilized 
arrays of probes tiled for multiple reference sequences. One 
such array comprises at least one pair of first and second 
probe groups, each group comprising first and second sets of 

35 probes as defined in the first embodiment. Each probe in the 
first probe set from the first group is exactly complementary 
to a subsequence of a first reference sequence, and each probe 
in the first probe set from the second group is exactly 
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complementary to a subsequence of a second reference sequence. 
Thus, the first group of probes are tiled with respect to a 
first reference sequence and the second group of probes with 
respect to a second reference sequence. Each group of probes 
5 can also include third and fourth sets of probes as defined in 
the second embodiment. In some arrays of this type, the 
second reference sequence is a mutated form of the first 
reference sequence. 

In a fourth embodiment, the invention provides arrays for 

10 block tiling. Block tiling is a species of the general tiling 
strategies described above. The usual unit of a block tiling 
array is a group of probes comprising a wildtype probe, a 
first set of three mutant probes and a second set of three 
mutant probes. The wildtype probe comprises a segment of at 

15 least three nucleotides exactly complementary to a subsequence 
of a reference sequence. The segment has at least first and 
second interrogation positions corresponding to first and 
second nucleotides in the reference sequence. The probes in 
the first set of three mutant probes are each identical to a 

20 sequence comprising the wildtype probe or a subsequence of at 
least three nucleotides thereof including the first and second 
interrogation positions, except in the first interrogation 
position, which is occupied by a different nucleotide in each 
of the three mutant probes and the wildtype probe. The probes 

25 in the second set of three mutant probes are each identical to 
a sequence comprising the wildtype probes or a subsequence of 
at least three nucleotides thereof including the first and 
second interrogation positions, except in the second 
interrogation position, which is occupied by a different 

30 nucleotide in each of the three mutant probes and the wildtype 
probe . 

In a fifth embodiment, the invention provides methods of 
comparing a target sequence with a reference sequence using 
arrays of immobilized pooled probes. The arrays employed in 
3 5 these methods represent a further species of the general 

tiling arrays noted above. In these methods, variants of a 
reference sequence differing from the reference sequence in at 
least one nucleotide are identified and each is assigned a 
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designation. An array of pooled probes is provided, with each 
pool occupying a separate cell of the array. Each pool 
comprises a probe comprising a segment exactly complementary 
to each variant sequence assigned a particular designation. 
5 The array is then contacted with a target sequence comprising 
a variant of the reference sequence. The relative 
^ hybridization intensities of the pools in the array to the 

target sequence are determined. The identity of the target 
sequence is deduced from the pattern of hybridization 

10 intensities. Often, each variant is assigned a designation 
having at least one digit and at least one value for the 
digit. In this case, each pool comprises a probe comprising a 
segment exactly complementary to each variant sequence 
assigned a particular value in a particular digit. When 

15 variants are assigned successive numbers in a numbering system 
of base m having n digits, n x (m-l> pooled probes are used 
are used to assign each variant a designation. 

In a sixth embodiment, the invention provides a pooled 
probe for trellis tiling, a further species of the general 

20 tiling strategy. In trellis tiling, the identity of a 
nucleotide in a target sequence is determined from a 
comparison of hybridization intensities of three pooled 
trellis probes. A pooled trellis probe comprises a segment 
exactly complementary to a subsequence of a reference sequence 

25 except at a first interrogation position occupied by a pooled 
nucleotide N, a second interrogation position occupied by a 
pooled nucleotide selected from the group of three consisting 
of (1) M or K, (2) R or Y and (3) S or W, and a third 
interrogation position occupied by a second pooled nucleotide 

30 selected from the group. The pooled nucleotide occupying the 
second interrogation position comprises a nucleotide 
complementary to a corresponding nucleotide from the reference 
sequence when the second pooled probe and reference sequence 
are maximally aligned, and the pooled nucleotide occupying the 
^ 35 third interrogation position comprises a nucleotide 

complementary to a corresponding nucleotide from the reference 
sequence when the third pooled probe and the reference 
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sequence are maximally aligned. Standard lUPAC nomenclature 
is used for describing pooled nucleotides. 

In trellis tiling, an array comprises at least first, 
second and third cells, respectively occupied by first, second 
5 and third pooled probes, each according to the generic 

description above. However, the segment of complementarity, 
location of interrogation positions, and selection of pooled ^ 
nucleotide at each interrogation position may or may not 
differ between the three pooled probes subject to the 

10 following constraint • One of the three interrogation 

positions in each of the three pooled probes must align with 
the same corresponding nucleotide in the reference sequence. 
This interrogation position must be occupied by a N in one of 
the pooled probes, and a different pooled nucleotide in each 

15 of the other two pooled probes. 

In a seventh embodiment, the invention provides arrays 
for bridge tiling. Bridge tiling is a species of the general 
tiling strategies noted above, in which probes from the first 
probe set contain more than one segment of complementarity. 

20 In bridge tiling, a nucleotide in a reference sequence is 

usually determined from a comparison of four probes. A first 
probe comprises at least first and second segments, each of at 
least three nucleotides and each exactly complementary to 
first and second subsequences of a reference sequences. The 

25 segments including at least one interrogation position 

corresponding to a nucleotide in the reference sequence. 
Either (1) the first and second subsequences are noncontiguous 
in the reference sequence, or (2) the first and second 
subsequences are contiguous and the first and second segments 

30 are inverted relative to the first and second subsequences. 

The arrays further comprises second, third and fourth probes, 
which are identical to a sequence comprising the first probe 
or a subsequence thereof comprising at least three nucleotides 
from each of the first and second segments, except in the at 

35 least one interrogation position, which differs in each of the 
probes. In a species of bridge tiling, referred to as 
deletion tiling, the first and second subsequences are 
separated by one or two nucleotides in the reference sequence. 
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In an eighth embodiment, the invention provides arrays of 
probes for multiplex tiling. Multiplex tiling is a strategy, 
in which the identity of two nucleotides in a target sequence 
is determined from a comparison of the hybridization 
5 intensities of four probes, each having two interrogation 
positions. Each of the probes comprising a segment of at 
least 7 nucleotides that is exactly complementary to a 
subsequence from a reference sequence, except that the segment 
may or may not be exactly complementary at two interrogation 

10 positions. The nucleotides occupying the interrogation 

positions are selected by the following rules: (1) the first 
interrogation position is occupied by a different nucleotide 
in each of the four probes, (2) the second interrogation 
position is occupied by a different nucleotide in each of the 

15 four probes, (3) in first and second probes, the segment is 
exactly complementary to the subsequence, except at no more 
than one of the interrogation positions, (4) in third and 
fourth probes, the segment is exactly complementary to the 
subsequence, except at both of the interrogation positions. 

20 In a ninth embodiment, the invention provides arrays of 

immobilized probes including helper mutations. Helper 
mutations are useful for, e.g., preventing self -annealing of 
probes having inverted repeats. In this strategy, the 
identity of a nucleotide in a target sequence is usually 

25 determined from a comparison of four probes. A first probe 
comprises a segment of at least 7 nucleotides exactly 
complementary to a subsequence of a reference sequence except 
at one or two positions, the segment including an 
interrogation position not at the one or two positions. The 

30 one or two positions are occupied by helper mutations. 

Second, third and fourth mutant probes are each identical to a 
sequence comprising the wildtype probe or a subsequence 
thereof including the interrogation position and the one or 
two positions, except in the interrogation position, which is 

35 occupied by a different nucleotide in each of the four probes. 

In a tenth embodiment, the invention provides arrays of 
probes comprising at least two probe sets, but lacking a probe 
set comprising probes that are perfectly matched to a 



PCTAJS94/12305 

WO 95/11995 

8 

reference sequence. Such arrays are usually employed in 
methods in which both reference and target sequence are 
hybridized to the array. The first probe set comprising a 
plurality of probes, each probe comprising a segment exactly 
5 complementary to a subsequence of at least 3 nucleotides of a 
reference sequence except at an interrogation position. The 
second probe set comprises a corresponding probe for each 
probe in the first probe set, the corresponding probe in the 
second probe set being identical to a sequence comprising the 

10 corresponding probe from the first probe set or a subsequence 
of at least three nucleotides thereof that includes the 
interrogation position, except that the interrogation position 
is occupied by a different nucleotide in each of the two 
corresponding probes and the complement to the reference 

15 sequence. 

In an eleventh embodiment, the. invention provides methods 
of comparing a target sequence with a reference sequence 
comprising a predetermined sequence of nucleotides using any 
of the arrays described above. The methods comprise 

20 hybridizing the target nucleic acid to an array and 

determining which probes, relative to one another, in the 
array bind specifically to the target nucleic acid. The 
relative specific binding of the probes indicates whether the 
target sequence is the same or different from the reference 

25 sequence. In some such methods, the target sequence has a 

substituted nucleotide relative to the reference sequence in 
at least one undetermined position, and the relative specific 
binding of the probes indicates the location of the position 
and the nucleotide occupying the position in the target 

30 sequence. In some methods, a second target nucleic acid is 
also hybridized to the array. The relative specific binding 
of the probes then indicates both whether the target sequence 
is the same or different from the reference sequence, and 
whether the second target sequence is the same or different 

35 from the reference sequence. In some methods, when the array 
comprises two groups of probes tiled for first and second 
reference sequences, respectively, the relative specific 
binding of probes in the first group indicates whether the 
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target sequence is the same or different froin the first 
reference sequence. The relative specific binding of probes 
in the second group indicates whether the target sequence is 
the same or different from the second reference sequence. 
5 Such methods are particularly useful for analyzing 

heterologous alleles of a gene. Some methods entail 
hybridizing both a reference sequence and a target sequence to 
any of the arrays of probes described above. Comparison of 
the relative specific binding of the probes to the reference 

10 and target sequences indicates whether the target sequence is 
the same or different from the reference sequence. 

In a twelfth embodiment, the invention provides arrays of 
immobilized probes in which the probes are designed to tile a 
reference sequence from a human immunodeficiency virus. 

15 Reference sequences from either the reverse transcriptase gene 
or protease gene of HIV are of particular interest. Some 
chips further comprise arrays of probes tiling a reference 
sequence from a 16S RNA or DNA encoding the 16S RNA from a 
pathogenic microorganism. The invention further provides 

20 methods of using such arrays in analyzing a HIV target 

sequence. The methods are particularly useful where the 
target sequence has a substituted nucleotide relative to the 
reference sequence in at least one position, the substitution 
conferring resistance to a drug use in treating a patient 

25 infected with a HIV virus. The methods reveal the existence 
of the substituted nucleotide. The methods are also 
particularly useful for analyzing a mixture of undetermined 
proportions of first and second target sequences from 
different HIV variants. The relative specific binding of 

30 ptobes indicates the proportions of the first and second 
target sequences. 

In a thirteenth embodiment, the invention provides arrays 
of probes tiled based on reference sequence from a CFTR gene. 
A preferred array comprises at least a group of probes 

35 comprising a wildtype probe, and five sets of three mutant 
probes. The wildtype probe is exactly complementary to a 
subsequence of a reference sequence from a cystic fibrosis 
gene, the secfment having at least five interrogation positions 
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corresponding to five contiguous nucleotides in the reference 
sequence. The probes in the first set of three mutant probes 
are each identical to the wildtype probe, except in a first of 
the five interrogation positions, which is occupied by a 
5 different nucleotide in each of the three mutant probes and 
the wildtype probe. The probes in the second set of three 
mutant probes are each identical to the wildtype probe, except 
in a second of the five interrogation positions, which is 
occupied by a different nucleotide in each of the three mutant 

10 probes and the wildtype probe. The probes in the third set of 
three mutant probes are each identical to the. wildtype probe, 
except in a third of the five interrogation positions, which 
is occupied by a different nucleotide in each of the three 
mutant probes and the wildtype probe. The probes in the 

15 fourth set of three mutant probes are each identical to the 
wildtype probe, except in a fourth of the five interrogation 
positions, which is occupied by a different nucleotide in each 
of the three mutant probes and the wildtype probe. The probes 
in the fifth set of three mutant probes are each identical to 

20 the wildtype probe, except in a fifth of the five 

interrogation positions, which is occupied by a different 
nucleotide in each of the three mutant probes and the wildtype 
probe. Preferably, a chip comprises two such groups of 
probes. The first group comprises a wildtype probe exactly 

25 complementary to a first reference sequence, and the second 
group comprises a wildtype probe exactly complementary to a 
second reference sequence that is a mutated form of the first 
reference sequence. 

The invention further provides methods of using the 

3 0 arrays of the invention for analyzing target sequences from a 
CFTR gene. The methods are capable of simultaneously 
analyzing first and second target sequences representing 
heterozygous alleles of a CFTR gene. 

In a fourteenth embodiment, the invention provides arrays 

35 of probes tiling a reference sequence from a p53 gene, an 
hMLHl gene and/or an MSH2 gene. The invention further 
provides methods of using the arrays described above to 



PCTAJS94/12305 

WO 95/11995 

11 

analyze these genes. The method are useful, e.g., for 
diagnosing patients susceptible to developing cancer. 

In a fifteenth embodiment, the invention provides arrays 
of probes tiling a reference sequence from a mitochondrial 
genome. The reference sequence may comprise part or all of 
the D-loop region, or all, or substantially all, of the 
mitochondrial genome. The invention further provides method 
of using the arrays described above to analyze target 
sequences from a mitochondrial genome. The methods are useful 
for identifying mutations associated with disease, and for 
forensic, epidemiological and evolutionary studies. 

BRIEF DESCRIPTION OF THE FIGURES 

Fig. 1: Basic tiling strategy. The figure illustrates 
the relationship between an interrogation position (I) and a 
corresponding nucleotide (n) in the reference sequence, and 
between a probe from the first probe set and corresponding 
probes from second, third and fourth probe sets. 

Fig. 2: Segment of complementarity in a probe from the 
first probe set. 

Fig. 3: Incremental succession of probes in a basic 
tiling strategy. The figure shows four probe sets, each 
having three probes. Note that each probe differs from its 
predecessor in the same set by the acquisition of a 5' 
nucleotide and the loss of a 3 • nucleotide, as well as in the 
nucleotide occupying the interrogation position. 

Fig. 4: Exemplary arrangement of lanes on a chip. The 
chip shows four probe sets, each having five probes and each 
having a total of five interrogation positions (11-15), one 
per probe. 

Fig. 5: Hybridization pattern of chip having probes laid 
down in lanes. Dark patches indicate hybridization. The 
probes in the lower part of the figure occur at the column of 
the array indicated by the arrow when the probes length is 15 
and the interrogation position 7. 

Fig. 6: Strategies for detecting deletion and insertion 
mutations. Bases in brackets may or may not be present. 
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Fig. 7: Block tiling strategy. The probe from the first 
probe set has three interrogation positions. The probes from 
the other probe sets have only one of these interrogation 
positions. 

5 Fig. 8: Multiplex tiling strategy. Each probe has two 

interrogation positions. 

Fig. 9. Helper mutation strategy. The segment of 
complementarity differs from the complement of the reference 
sequence at a helper mutation as well as the interrogation 
10 position. 

Fig. 10 Layout of probes on the HV 407 chip. The figure 
shows successive rows of sequence each of which is subdivided 
into four lanes. The four lanes correspond to the G- 
and T-lanes on the chip. Each probe is represented by the 
15 nucleotide occupying its interrogation position. The letter 

"N" indicates a control probe or empty column. The different 
sized-probes are laid out in parallel. That is, from top-to- 
bottom, a row of 13 mers is followed by a row of 15 mers, 
which is followed by a row of 17 mers, which is followed by a 
20 row of 19 mers. 

Fig. 11 Fluorescence pattern of HV 407 hybridized to a 
target sequence {pPoll9) identical to the chips reference 
sequence. 

Fig. 12 Sequence read from HV 407 chip hybridized to 
25 pPoll9 and 4MUT18 (separate experiments) . The reference 
sequence is designated "wildtype." Beneath the reference 
sequence are four rows of sequence read from the chip 
hybridized to the pPoll9 target, the first row being read from 
13 mers, the second row from 15 mers, the third row from 17 
30 mers and the fourth row from 19 mers. Beneath these 

sequences, there are four further rows of sequence read from 
the chip hybridized to the HXB2 target. Successive rows are 
read from 13 mers, 15 mers, 17 mers and 19 mers. Each 
nucleotide in a row is called from the relative fluorescence 
35 intensities of probes in A-, C-, G- and T-lanes. Regions of 
ambiguous sequence read from the chip are highlighted. The 
strain differences between the HBX2 sequence and the reference 
sequence that were correctly detected are indicated {*) , and 
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those that could not be called are indicated (o) . (The 
nucleotide at position 417 was read correctly in some 
experiments) . The location of some mutations known to be 
associated with drug resistance that occur in readable regions 
5 of the chip are shown above (codon number) and below (mutant 
nucleotide) the sequence designated "wildtype." The locations 
of primer used to amplify the target sequence are indicated by 
arrows . 

Fig. 13: Detection of mixed target sequences. The 
10 mutant target differs from the wildtype by a single mutation 

in codon 67 of the reverse transcriptase gene. Each different 
sized group of probes has a column of four probes for reading 
the nucleotide in which the mutation occurs. The four probes 
occupying a column are represented by a single probe in the 
15 figure with the symbol (o) indicating the interrogation 

position, which is occupied by a different nucleotide in each 
probe. 

Fig. 14: Fluorescence intensities of target bound to 13 
mers and 15 mers for different proportions of mutant and 

20 wildtype target. The fluorescence intensities are from probes 
having interrogation positions for reading the nucleotide at 
which the mutant and wildtype targets diverge. 

Fig. 15: Sequence read from protease chip from four 
clinical samples before and after treatment with ddl>. 

25 Fig. 16: Block tiling array of probes for analyzing a 

CFTR point mutation. Each probe show actually represents four 
probes, with one probe having each of A, C, G or T at the 
interrogation position N. In the order shown, the first probe 
shown on the left is tiled from the wildtype reference 

3 0 sequence, the second probe from the mutant sequence, and so on 
in alternating fashion. Note that all of the probes are 
identical except at the interrogation position, which shifts 
one position between successive probes tiled from the same 
reference sequence (e.g., the first, third and fifth probes in 

35 the left hand column.) The grid shows the hybridization 
intensities when the array is hybridized to the reference 
sequence. 
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Fig. 17: Hybridization pattern for heterozygous target. 
The figure shows the hybridization pattern when the array of 
the previous figure is hybridized to a mixture of mutant and 
wildtype reference sequences. 
5 Fig. 18, in panels A, B, and C, shows an image made from 

the region of a DNA chip containing CFTR exon 10 probes; in 
panel A, the chip was hybridized to a wild-type target; in 
panel the chip was hybridized to a mutant AF508 target; and 
in panel B, the chip was hybridized to a mixture of the 

10 wild-type and mutant targets. 

Fig. 19, in sheets 1-3, corresponding to panels A, B, 
and C of Fig. 18, shows graphs of fluorescence intensity 
versus tiling position. The labels on the horizontal axis 
show the bases in the wild-type sequence corresponding to the 

15 position of substitution in the respective probes. Plotted 
are the intensities observed from the features (or synthesis 
sites) containing wild-type probes, the features containing 
the substitution probes that bound the most target ("called"), 
and the feature containing the substitution probes that bound 

20 the target with the second highest intensity of all the 
substitution probes ("2nd Highest"). 

Fig. 20, in panels A, B, and C, shows an image made from 
a region of a DNA chip containing CFTR exon 10 probes; in 
panel A, the chip was hybridized to the wt480 target; in panel 

25 C, the chip was hybridized to the mu480 target; and in panel 
B, the chip was hybridized to a mixture of the wild-type and 
mutant targets. 

Fig. 21, in sheets 1-3, corresponding to panels A, B, 
and C of Fig. 20, shows graphs of fluorescence intensity 

30 versus tiling position. The labels on the horizontal axis 

show the bases in the wild-type sequence corresponding to the 
position of substitution in the respective probes. Plotted 
are the intensities observed from the features (or synthesis 
sites) containing wild-type probes, the features containing 

35 the substitution probes that bound the most target ("called"), 
and the feature containing the substitution probes that bound 
the target with the second highest intensity of all the 
substitution probes ("2nd Highest"). 
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Fig. 22 f in panels A and B, shows an image made from a 
region of a DNA chip containing CFTR exon 10 probes; in panel 
A, the chip was hybridized to nucleic acid derived from the 
genomic DNA of an individual with wild-type AF508 sequences; 
5 in panel B, the target nucleic acid originated from a 

heterozygous (with respect to the AF508 mutation) individual. 

Fig. 23, in sheets 1 and 2, corresponding to panels A and 
B of Fig. 22, shows graphs of fluorescence intensity versus 
tiling position. The labels on the horizontal axis show the 
10 bases in the wild-type sequence corresponding to the position 
of substitution in the respective probes. Plotted are the 
intensities observed from the features (or synthesis sites) 
containing wild-type probes, the features containing the 
substitution probes that bound the most target ("called") , and 
15 the feature containing the substitution probes that bound the 
target with the second highest intensity of all the 
substitution probes ("2nd Highest"). 

Fig. 24: Hybridization of homozygous wildtype (A) and 
heterozygous (B) target sequences from exon 11 of the CFTR 
20 gene to a block tiling array designed to detect G551D and 
Q552X mutations in CFTR gene. 

Fig. 25: Hybridization of homozygous wildtype (A) and 
AF508 mutant (B) target sequences from exon 10 of the CFTR 
gene to a block tiling array designed to detect mutations, 
25 AF508, AI507 and F508C. 

Fig. 26: Hybridization of heterozygous mutant target 
sequences, AF508/F508C, to the array of Fig. 25. 

Fig. 27 shows the alignment of some of the probes on a 
p53 DNA chip with a 12-mer model target nucleic acid. 
30 Fig. 28 shows a set of IQ-mer probes for a p53 exon 6 DNA 

chip. 

Fig- 29 shows that very distinct patterns are observed 
after hybridization of p53 DNA chips with targets having 
different 1 base substitutions. In the first image in Fig. 
35 29, the 12-mer probes that form perfect matches with the 
wild-type target are in the first row (top). The 12-mer 
probes with single base mismatches are located in the second, 
third, and fourth rows and have much lower signals. 
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Fig. 30, in graphs 2, 3, and 4, graphically depicts the 
data in Fig. 29. On each graph, the X ordinate is the 
position of the probe in its row on the chip, and the Y 
ordinate is the signal at that probe site after hybridization. 
5 Fig. 31 shows the results of hybridizing mixed target 

populations of WT and mutant p53 genes to the p53 DNA chip. 

Fig. 32, in graphs 1-4, shows (see Fig. 30 as well) the 
hybridization efficiency of a 10-mer probe array as compared 
to a 12-mer probe array. 
10 Fig. 3 3 shows an image of a p53 DNA chip hybridized to a 

target DNA. 

Fig. 34 illustrates how the actual sequence was read from 
the chip shown in Fig. 33. Gaps in the sequence of letters in 
the WT rows correspond to control probes or sites. Positions 
15 at which bases are miscalled are represented by letters in 

italic type in cells corresponding to probes in which the WT 
bases have been substituted by other bases. 

Fig. 3 5 shows the human mitochondrial genome; "Ojj" is the 
H strand origin of replication, and arrows indicate the cloned 
20 unshaded sequence. 

Fig. 36 shows the image observed from application of a 
sample of mitochondrial DNA derived nucleic acid (from the mt4 
sample) on a DNA chip. 

Fig. 37 is similar to Fig. 3 6 but shows the image 
25 observed from the mtS sample. 

Fig. 38 shows the predicted difference image between the 
mt4 and mt5 samples on the DNA chip based on mismatches 
between the two samples and the reference sequence. 

Fig. 39 shows the actual difference image observed for 
30 the mt4 and mt5 samples. 

Fig. 40, in sheets 1 and 2, shows a plot of normalized 
intensities across rows 10 and 11 of the array and a 
tabulation of the mutations detected. 

Fig. 41 shows the discrimination between wild-type and 
35 mutant hybrids obtained with the chip. A median of the six 

normalized hybridization scores for each probe was taken; the 
graph plots the ratio of the median score to the normalized 
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hybridization score versus mean counts. A ratio of 1.6 and 
mean counts above 50 yield no false positives. 

Fig. 4 2 illustrates how the identity of the base mismatch 
may influence the ability to discriminate mutant and wild-type 
sequences more than the position of the mismatch within an 
oligonucleotide probe. The mismatch position is expressed as 
% of probe length from the 3 '-end. The base change is 
indicated on the graph. 

Fig. 43 provides a 5' to 3' sequence listing of one 
target corresponding to the probes on the chip. X is a 
control probe. Positions that differ in the target (i.e., are 
mismatched with the probe at the designated site) are in bold. 

Fig. 44 shows the fluorescence image produced by scanning 
the chip described in Fig. 17 when hybridized to a sample. 

Fig. 4 5 illustrates the detection of 4 transitions in the 
target sequence relative to the wild-type probes on the chip 
in Fig. 44. 

Fig. 46: VLSIPS"' technology applied to the light 
directed synthesis of oligonucleotides. Light (hv) is shone 
through a mask (M^^) to activate functional groups (-OH) on a 
surface by removal of a protecting group (X) . Nucleoside 
building blocks protected with photoremovable protecting 
groups (T-X, C-X) are coupled to the activated areas. By 
repeating the irradiation and coupling steps, very complex 
arrays of oligonucleotides can be prepared. 

Fig. 47: Use of the VLSIPS™ process to prepare 
"nucleoside combinatorials" or oligonucleotides synthesized by 
coupling all four nucleosides to form dimers, trimers, and so 
forth. 

Fig. 48: Deprotection, coupling, and oxidation steps of 
a solid phase DNA synthesis method. 

Fig. 49: An illustrative synthesis route for the 
nucleoside building blocks used in the VLSIPS™ method. 

Fig. 50: A preferred photoremovable protecting group, 
MeNPOC, and preparation of the group in active form. 

Fig. 51: Detection system for scanning a DNA chip. 
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DETAILED DESCRIPTION OF THE INVENTION 
The invention provides a number of strategies for 
comparing a polynucleotide of known sequence (a reference 
sequence) with variants of that sequence (target sequences) * 
5 The comparison can be performed at the level of entire 

genomes, chromosomes, genes, exons or introns, or can focus on 
individual mutant sites and immediately adjacent bases. The 
strategies allow detection of variations, such as mutations or 
polymorphisms, in the target sequence irrespective whether a 

10 particular variant has previously been characterized. The 
strategies both define the nature of a variant and identify 
its location in a target sequence. 

The strategies employ arrays of oligonucleotide probes 
immobilized to a solid support. Target sequences are analyzed 

15 by determining the extent of hybridization at particular 
probes in the array. The strategy in selection of probes 
facilitates distinction between perfectly matched probes and 
probes showing single-base or other degrees of mismatches. 
The strategy usually entails sampling each nucleotide of 

20 interest in a target sequence several times, thereby achieving 
a high degree of confidence in its identity. This level of 
confidence is further increased by sampling of adjacent 
nucleotides in the target sequence to nucleotides of interest. 
The number of probes on the chip can be quite large (e.g., 

25 10^-10^). However, usually only a small proportion of the 
total number of probes of a given length are represented. 
Some advantage of the use of only a small proportion of all 
possible probes of a given length include: (i) each position 
in the array is highly informative, whether or not 

30 hybridization occurs; (ii) nonspecific hybridization is 
minimized; (iii) it is straightforward to correlate 
hybridization differences with sequence differences, 
particularly with reference to the hybridization pattern of a 
known standard; and (iv) the ability to address each probe 

35 independently during synthesis, using high resolution 
photolithography, allows the array to be designed and 
optimized for any sequence. For example the length of any 
probe can be varied independently of the others. 
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The present tiling strategies result in sequencing and 
comparison methods suitable for routine large-scale practice 
with a high degree of confidence in the sequence output. 



I. GENERAL TILING STRATEGIES 

A. Selection of Reference Sequence 

The chips are designed to contain probes exhibiting 
complementarity to one or more selected reference sequence 
whose sequence is known. The chips are used to read a target 
sequence comprising either the reference sequence itself or 
variants of that sequence. Target sequences may differ from 
the reference sequence at one or more positions but show a 
high overall degree of sequence identity with the reference 
sequence (e.g., at least 75, 90, 95, 99, 99.9 or 99.99%). Any 
polynucleotide of known sequence can be selected as a 
reference sequence. Reference sequences of interest include 
sequences known to include mutations or polymorphisms 
associated with phenotypic changes having clinical 
significance in human patients. For example, the CFTR gene 
and P53 gene in humans have been identified as the location of 
several mutations resulting in cystic fibrosis or cancer 
respectively. Other reference sequences of interest include 
those that serve to identify pathogenic microorganisms and/or 
are the site of mutations by which such microorganisms acquire 
drug resistance (e.g., the HIV reverse transcriptase gene). 
Other reference sequences of interest include regions where 
polymorphic variations are known to occur (e.g., the D-loop 
region of mitochondrial DNA) . These reference sequences have 
utility for, e.g., forensic or epidemiological studies, other 
reference sequences of interest include p34 (related to p53) , 
p65 (implicated in breast, prostate and liver cancer) , and DNA 
segments encoding cytochromes P450 (see Meyer et al., Pharmac. 
Ther. 46, 349-355 (1990)). Other reference sequences of 
interest include those from the genome of pathogenic viruses 
(e.g., hepatitis (A, B, or C) , herpes virus (e.g., VZV, HSV-1, 
HAV-6; HSV-II; and CMV, Epstein Barr virus), adenovirus, 
influenza virus, f laviviruses, echovirus, rhinovirus, 
coxsackie virus, cornovirus, respiratory syncytial virus, 
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mumps virus, rotavirus, measles virus, rubella virus, 
parvovirus, vaccinia virus, HTLV virus, dengue virus, 
papillomavirus, molluscum virus, poliovirus, rabies virus, JC 
virus and arboviral encephalitis virus. Other reference 
sequences of interest are from genomes or episomes of 
pathogenic bacteria, particularly regions that confer drug 
resistance or allow phylogenic characterization of the host 
(e.g., 16S rRNA or corresponding DNA) . For example, such 
bacteria include chlamydia, rickettsial bacteria, 
mycobacteria, staphylococci, treptocci, pneumonococci, 
meningococci and conococci, Klebsiella, proteus, serratia, 
pseudomonas, legionella, diphtheria, salmonella, bacilli, 
cholera, tetanus, botulism, anthrax, plague, leptospirosis, 
and Lymes disease bacteria- Other reference sequences of 
interest include those in which mutations result in the 
following autosomal recessive disorders: sickle cell anemia, 
)3-thalassemia, phenylketonuria, galactosemia, Wilson's 
disease, hemochromatosis, severe combined immunodeficiency, 
alpha-l-antitrypsin deficiency, albinism, alkaptonuria, 
lysosomal storage diseases and Ehlers-Danlos syndrome. Other 
reference sequences of interest include those in which 
mutations result in X-linked recessive disorders: hemophilia, 
glucose-6-phosphate dehydrogenase, agammaglobulimenia, 
diabetes insipidus, Lesch-Nyhan syndrome, muscular dystrophy, 
Wiskott-Aldrich syndrome, Fabry's disease and fragile X- 
syndrome. Other reference sequences of interest includes 
those in which mutations result in the following autosomal 
dominant disorders: familial hypercholesterolemia, polycystic 
kidney disease, Huntingdon's disease, hereditary 
spherocytosis, Marfan 's syndrome, von Willebrand's disease, 
neurofibromatosis, tuberous sclerosis, hereditary hemorrhagic 
telangiectasia, familial colonic polyposis, Ehlers-Danlos 
syndrome, myotonic dystrophy, muscular dystrophy, osteogenesis 
imperfecta, acute intermittent porphyria, and von Hippel- 
Lindau disease. 

The length of a reference sequence can vary widely from a 
full-length genome, to an individual chromosome, episome, 
gene, component of a gene, such as an exon, intron or 
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regulatory sequences, to a few nucleotides. A reference 
sequence of between about 2, 5, 10, 20, 50, 100, 5000, 1000, 
5,000 or 10,000, 20,000 or 100,000 nucleotides is common. 
Sometimes only particular regions of a sequence (e.g., exons 
of a gene) are of interest. In such situations, the 
particular regions can be considered as separate reference 
sequences or can be considered as components of a single 
reference sequence, as matter of arbitrary choice. 

A reference sequence can be any naturally occurring, 
mutant, consensus or purely hypothetical sequence of 
nucleotides, RNA or DNA. For example, sequences can be 
obtained from computer data bases, publications or can be 
determined or conceived de novo. Usually, a reference 
sequence is selected to show a high degree of sequence 
identity to envisaged target sequences. Often, particularly, 
where a significant degree of divergence is anticipated 
between target sequences, more than one reference sequence is 
selected. Combinations of wildtype and mutant reference 
sequences are employed in several applications of the tiling 
strategy . 

B. Chip Design 

1. Basic Tiling Strategy 

The basic tiling strategy provides an array of 
immobilized probes for analysis of target sequences showing a 
high degree of sequence identity to one or more selected 
reference sequences. The strategy is first illustrated for an 
array that is subdivided into four probe sets, although it 
will be apparent that in some situations, satisfactory results 
are obtained from only two probe sets. A first probe set 
comprises a plurality of probes exhibiting perfect 
complementarity with a selected reference sequence. The 
perfect complementarity usually exists throughout the length 
of the probe. However, probes having a segment or segments of 
perfect complementarity that is/ are flanked by leading or 
trailing sequences lacking complementarity to the reference 
sequence can also be used. Within a segment of 
complementarity, each probe in the first probe set has at 
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least one interrogation position that corresponds to a 
nucleotide in the reference sequence. That is, the 
interrogation position is aligned with the corresponding 
nucleotide in the reference sequence, when the probe and 
5 reference sequence are aligned to maximize complementarity 

between the two. If a probe has more than one interrogation 
position, each corresponds with a respective nucleotide in the 
reference sequence. The identity of an interrogation position 
and corresponding nucleotide in a particular probe in the 

10 first probe set cannot be determined simply by inspection of 
the probe in the first set. As will become apparent, an 
interrogation position and corresponding nucleotide is defined 
by the comparative structures of probes in the first probe set 
and corresponding probes from additional probe sets. 

15 In principle, a probe could have an interrogation 

position at each position in the segment complementary to the 
reference sequence. Sometimes, interrogation positions 
provide more accurate data when located away from the ends of 
a segment of complementarity. Thus, typically a probe having 

20 a segment of complementarity of length x does not contain more 
than x-2 interrogation positions. Since probes are typically 
9-21 nucleotides, and usually all of a probe is complementary, 
a probe typically has 1-19 interrogation positions. Often the 
probes contain a single interrogation position, at or near the 

25 center of probe. 

For each probe in the first set, there are, for purposes 
of the present illustration, three corresponding probes from 
three additional probe sets. See Fig. 1. Thus, there are 
four probes corresponding to each nucleotide of interest in 

30 the reference sequence. Each of the four corresponding probes 
has an interrogation position aligned with that nucleotide of 
interest. Usually, the probes from the three additional 
probe sets are identical to the corresponding probe from the 
first probe set with one exception. The exception is that at 

35 least one (and often only one) interrogation position, which 
occurs in the same position in each of the four corresponding 
probes from the four probe sets, is occupied by a different 
nucleotide in the four probe sets. For example, for an A 
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nucleotide in the reference sequence, the corresponding probe 
from the first probe set has its interrogation position 
occupied by a T, and the corresponding probes from the 
additional three probe sets have their respective 
5 interrogation positions occupied by A, C, or G, a different 
nucleotide in each probe. Of course, if a probe from the 
first probe set comprises trailing or flanking sequences 
lacking complementarity to the reference sequences (see 
Fig. 2) , these sequences need not be present in corresponding 

10 probes from the three additional sets. Likewise corresponding 
probes from the three additional sets can contain leading or 
trailing sequences outside the segment of complementarity that 
are not present in the corresponding probe from the first 
probe set. Occasionally, the probes from the additional three 

15 probe set are identical (with the exception of interrogation 
position(s)) to a contiguous subsequence of the full 
complementary segment of the corresponding probe from the 
first probe set. In this case, the subsequence includes the 
interrogation position and usually differs from the full- 

20 length probe only in the omission of one or both terminal 

nucleotides from the termini of a segment of complementarity. 
That is, if a probe from the first probe set has a segment of 
complementarity of length n, corresponding probes from the 
other sets will usually include a subsequence of the segment 

25 of at least length n^2. Thus, the subsequence is usually at 
least 3, 4, 7, 9, 15, 21, or 25 nucleotides long, most 
typically, in the range of 9-21 nucleotides. The subsequence 
should be sufficiently long to allow a probe to hybridize 
detectably more strongly to a variant of the reference 

30 sequence mutated at the interrogation position than to the 
reference sequence. 

The probes can be oligodeoxyribonucleotides or 
oligoribonucleotides, or any modified forms of these polymers 
that are capable of hybridizing with a target nucleic sequence 

35 by complementary base-pairing. Complementary base pairing 
means sequence-specific base pairing which includes e.g., 
Watson-Crick base pairing as well as other forms of base 
pairing such as Hoogsteen base pairing. Modified foirms 
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include 2'-0-inethyl oligoribonucleotides and so-called PNAs, 
in which oligodeoxyribonucleotides are linked via peptide 
bonds rather than phophodiester bonds. The probes can be 
attached by any linkage to a support (e.gr., 3*, 5' or via the 
5 base). 3' attachment is more usual as this orientation is 
compatible with the preferred chemistry for solid phase 
synthesis of oligonucleotides. 

The nximber of probes in the first probe set (and as a 
consequence the number of probes in additional probe sets) 

10 depends on the length of the reference sequence, the number of 
nucleotides of interest in the reference sequence and the 
number of interrogation positions per probe. In general, each 
nucleotide of interest in the reference sequence requires the 
same interrogation position in the four sets of probes. 

15 Consider, as an example, a reference sequence of 100 

nucleotides, 50 of which are of interest, and probes each 
having a single interrogation position. In this situation, 
the first probe set requires fifty probes, each having one 
interrogation position corresponding to a nucleotide of 

20 interest in the reference sequence. The second, third and 
fourth probe sets each have a corresponding probe for each 
probe in the first probe set, and so each also contains a 
total of fifty probes. The identity of each nucleotide of 
interest in the reference sequence is determined by comparing 

25 the relative hybridization signals at four probes having 

interrogation positions corresponding to that nucleotide from 
the four probe sets. 

In some reference sequences, every nucleotide is of 
interest. In other reference sequences, only certain portions 

30 in which variants (e.g., mutations or polymorphisms) are 

concentrated are of interest. In other reference sequences, 
only particular mutations or polymorphisms and immediately 
adjacent nucleotides are of interest. Usually, the first 
probe set has interrogation positions selected to correspond 

35 to at least a nucleotide (e.g., representing a point mutation) 
and one immediately adjacent nucleotide. Usually, the probes 
in the first set have interrogation positions corresponding to 
at least 3, 10, 50, 100, 1000, or 20,000 contiguous 



wo 95/11995 PCT/US94/12305 

25 

nucleotides. The probes usually have interrogation positions 
corresponding to at least 5, 10, 30, 50, 15, 90, 99 or 
sometimes 100% of the nucleotides in a reference sequence. 
Frequently, the probes in the first probe set completely span 
5 the reference sequence and overlap with one another relative 
to the reference sequence • For example, in one common 
arrangement each probe in the first probe set differs from 
another probe in that set by the omission of a 3 ' base 
complementary to the reference sequence and the acquisition of 
10 a 5' base complementary to the reference sequence. See 
Fig. 3. 

For conceptual simplicity, the probes in a set are 
usually arranged in order of the sequence in a lane across the 
chip. A lane contains a series of overlapping probes, which 

15 represent or tile across, the selected reference sequence (see 
Fig. 3) . The components of the four sets of probes are 
usually laid down in four parallel lanes, collectively 
constituting a row in the horizontal direction and a series of 
4 -member columns in the vertical direction. Corresponding 

20 probes from the four probe sets (i.e., complementary to the 
same subsequence of the reference sequence) occupy a column. 
Each probe in a lane usually differs from its predecessor in 
the lane by the omission of a base at one end and the 
inclusion of additional base at the other end as shown in 

25 Fig. 3. However, this orderly progression of probes can be 

interrupted by the inclusion of control probes or omission of 
probes in certain colvimns of the array. Such columns serve as 
controls to orient the chip, or gauge the background, which 
can include target sequence nonspecif ically bound to the chip. 

30 The probes sets are usually laid down in lanes such that 

all probes having an interrogation position occupied by an A 
form anA"lane, all probes having an interrogation position 
occupied by a C foirm a C-lane, all probes having an 
interrogation position occupied by a G form a G-lane, and all 

35 probes having an interrogation position occupied by a T (or U) 
form a T lane (or a U lane) . Note that in this arrangement 
there is not a unique correspondence between probe sets and 
lanes. Thus, the probe from the first probe set is laid down 
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in the A-lane, C-lane, A-lane, A-lane and T-lane for the five 
columns in Fig. 4. The interrogation position on a column of 
probes corresponds to the position in the target sequence 
whose identity is determined from analysis of hybridization to 
5 the probes in that column. Thus, Ii-Is respectively 

correspond to N^-Ng in Fig. 4. The interrogation position can 
be anywhere in a probe but is usually at or near the central 
position of the probe to maximize differential hybridization 
signals between a perfect match and a single-base mismatch. 

10 For example, for an 11 mer probe, the central position is the 
sixth nucleotide. 

Although the array of probes is usually laid down in rows 
and columns as described above, such a physical arrangement of 
probes on the chip is not essential. Provided that the 

15 spatial location of each probe in an array is known, the data 
from the probes can be collected ai?d processed to yield the 
sequence of a target irrespective of the physical arrangement 
of the probes on a chip. In processing the data, the 
hybridization signals from the respective probes can be 

20 reasserted into any conceptual array desired for subsequent 

data reduction whatever the physical arrangement of probes on 
the chip. 

A range of lengths of probes can be employed in the 
chips. As noted above, a probe may consist exclusively of a 

25 complementary segments, or may have one or more complementary 
segments juxtaposed by flanking, trailing and/or intervening 
segments. In the latter situation, the total length of 
complementary segment (s) is more important that the length of 
the probe. In functional terms, the complementarity 

30 segment (s) of the first probe sets should be sufficiently long 
to allow the probe to hybridize detectably more strongly to a 
reference sequence compared with a variant of the reference 
including a single base mutation at the nucleotide 
corresponding to the interrogation position of the probe. 

35 Similarly, the complementarity segment (s) in corresponding 

probes from additional probe sets should be sufficiently long 
to allow a probe to hybridize detectably more strongly to a 
variant of the reference sequence having a single nucleotide 
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substitution at the interrogation position relative to the 
reference sequence. A probe usually has a single 
complementary segment having a length of at least 
3 nucleotides, and more usually at least 5, 6, 1, 8, 9, 10, 
5 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 or 
30 bases exhibiting perfect complementarity (other than 
possibly at the interrogation position (s) depending on the 
probe set) to the reference sequence. In bridging strategies, 
where more than one segment of complementarity is present, 

10 each segment provides at least three complementary nucleotides 
to the reference sequence and the combined segments provide at 
least two segments of three or a total of six complementary 
nucleotides. As in the other strategies, the combined length 
of complementary segments is typically from 6-3 0 nucleotides, 

15 and preferably from about 9-21 nucleotides. The two segments 
are often approximately the same length. Often, the probes 
(or segment of complementarity within probes) have an odd 
number of bases, so that an interrogation position can occur 
in the exact center of the probe. 

20 In some chips, all probes are the same length. Other 

chips employ different groups of probe sets, in which case the 
probes are of the same size within a group, but differ between 
different groups. For example, some chips have one group 
comprising four sets of probes as described above in which all 

25 the probes are 11 mers, together with a second group 

comprising four sets of probes in which all of the probes are 
13 mers. Of course, additional groups of probes can be added. 
Thus, some chips contain, e.g., four groups of probes having 
sizes of 11 mers, 13 mers, 15 mers and 17 mers. Other chips 

30 have different size probes within the same group of four probe 
sets. In these chips, the probes in the first set can vary in 
length independently of each other. Probes in the other sets 
are usually the same length as the probe occupying the same 
column from the first set. However, occasionally different 

35 lengths of probes can be included at the same column position 
in the four lanes. The different length probes are included 
to equalize hybridization signals from probes irrespective of 
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whether A-T or C-G bonds are formed at the interrogation 
position. 

The length of probe can be important in distinguishing 
between a perfectly matched probe and probes showing a single- 
5 base mismatch with the target sequence. The discrimination is 
usually greater for short probes. Shorter probes are usually 
also less susceptible to formation of secondary structures. # 
However, the absolute amount of target sequence bound, and 
hence the signal, is greater for larger probes. The probe 

10 length representing the optimum compromise between these 

competing considerations may vary depending on inter alia the 
GC content of a particular region of the target DNA sequence, 
secondary structure, synthesis efficiency and cross- 
hybridization. In some regions of the target, depending on 

15 hybridization conditions, short probes (e.g., 11 mers) may 
provide information that is inaccessible from longer probes 
(e.g., 19 mers) and vice versa. Maximum sequence information 
can be read by including several groups of different sized 
probes on the chip as noted above. However, for many regions 

20 of the target sequence, such a strategy provides redundant 

information in that the same sequence is read multiple times 
from the different groups of probes. Equivalent information 
can be obtained from a single group of different sized probes 
in which the sizes are selected to maximize readable sequence 

25 at particular regions of the target sequence. The appropriate 
size of probes at different regions of the target sequence can 
be determined from, e.g.. Fig. 12, which compares the 
readability of different sized probes in different regions of 
a target. The strategy of customizing probe length within a^ 

30 single group of probe sets minimizes the total number of 

probes required to read a particular target sequence. This 
leaves ample capacity for the chip to include probes to other 
reference sequences. 

The invention provides an optimization block which allows 

35 systematic variation of probe length and interrogation 

position to optimize the selection of probes for analyzing a 
particular nucleotide in a reference sequence. The block 
comprises alternating columns of probes complementary to the 
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wildtype target and probes complementary to a specific 
mutation. The interrogation position is varied between 
columns and probe length is varied down a column. 
Hybridization of the chip to the reference sequence or the 
mutant form of the reference sequence identifies the probe 
length and interrogation position providing the greatest 
differential hybridization signal. 

The probes are designed to be complementary to either 
strand of the reference sequence (e.g., coding or non-coding). 
Some chips contain separate groups of probes, one 
complementary to the coding strand, the other complementary to 
the noncoding strand. Independent analysis of coding and 
noncoding strands provides largely redundant information. 
However, the regions of ambiguity in reading the coding strand 
are not always the same as those in reading the noncoding 
strand. Thus, combination of the information from coding and 
noncoding strands increases the overall accuracy of 
sequencing. 

Some chips contain additional probes or groups of probes 
designed to be complementary to a second reference sequence. 
The second reference sequence is often a subsequence of the 
first reference sequence bearing one or more commonly 
occurring mutations or interstrain variations. The second 
group of probes is designed by the same principles as 
described above except that the probes exhibit complementarity 
to the second reference sequence. The inclusion of a second 
group is particular useful for analyzing short subsequences of 
the primary reference sequence in which multiple mutations are 
expected to occur within a short distance commensurate with 
the length of the probes (i.e., two or more mutations within 9 
to 21 bases). Of course, the same principle can be extended 
to provide chips containing groups of. probes for any number of 
reference sequences. Alternatively, the chips may contain 
additional probe (s) that do not form part of a tiled array as 
noted above, but rather serves as probe (s) for a conventional 
reverse dot blot. For example, the presence of mutation can 
be detected from binding of a target sequence to a single 
oligomeric probe harboring the mutation. Preferably, an 
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additional probe containing the equivalent region of the 
wildtype sequence is included as a control. 

The chips are read by comparing the intensities of 
labelled target bound to the probes in an array. 
Specifically, a comparison is performed between each lane of 
probes (e.g., A, C, G and T lanes) at each columnar position 
(physical or conceptual) . For a particular columnar position, 
the lane showing the greatest hybridization signal is called 
as the nucleotide present at the position in the target 
sequence corresponding to the interrogation position in the 
probes. See Fig. 5. The corresponding position in the target 
sequence is that aligned with the interrogation position in 
corresponding probes when the probes and target are aligned to 
maximize complementarity. Of the four probes in a column, 
only one can exhibit a perfect match to the target sequence 
whereas the others usually exhibit at least a one base pair 
mismatch. The probe exhibiting a perfect match usually 
produces a substantially greater hybridization signal than the 
other three probes in the column and is thereby easily 
identified. However, in some regions of the target sequence, 
the distinction between a perfect match and a one-base 
mismatch is less clear. Thus, a call ratio is established to 
define the ratio of signal from the best hybridizing probes to 
the second best hybridizing probe that must be exceeded for a 
particular target position to be read from the probes. A high 
call ration ensures that few if any errors are made in calling 
target nucleotides, but can result in some nucleotides being 
scored as ambiguous, which could in fact be accurately read. 
A lower call ratio results in fewer ambiguous calls, but can 
rfesult in more erroneous calls. It has been found that at a 
call ratio of 1*2 virtually all calls are accurate. However, 
a small but significant number of bases (e.g., up to about 
10%) may have to be scored as ambiguous. 

Although small regions of the target sequence can 
sometimes be ambiguous, these regions usually occur at the 
same or similar segments in different target sequences. Thus, 
for precharacterized mutations, it is known in advance whether 
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that mutation is likely to occur within a region of 
unambiguously determinable sequence. 

An array of probes is most useful for analyzing the 
reference sequence from which the probes were designed and 
5 variants of that sequence exhibiting substantial sequence 

similarity with the reference sequence (e.g., several single- 
base mutants spaced over the reference sequence) . When an 
array is used to analyze the exact reference sequence from 
which it was designed, one probe exhibits a perfect match to 

10 the reference sequence, and the other three probes in the same 
column exhibits single-base mismatches. Thus, discrimination 
between hybridization signals is usually high and accurate 
sequence is obtained. High accuracy is also obtained when an 
array is used for analyzing a target sequence comprising a 

15 variant of the reference sequence that has a single mutation 
relative to the reference sequence, or several widely spaced 
mutations relative to the reference sequence. At different 
mutant loci, one probe exhibits a perfect match to the target, 
and the other three probes occupying the same column exhibit 

20 single-base mismatches, the difference (with respect to 

analysis of the reference sequence) being the lane in which 
the perfect match occurs. 

For target sequences showing a high degree of divergence 
from the reference strain or incorporating several closely 

25 spaced mutations from the reference strain, a single group of 
probes (i.e., designed with respect to a single reference 
sequence) will not always provide accurate sequence for the 
highly variant region of this sequence. At some particular 
columnar positions, it may be that no single probe exhibits 

30 perfect complementarity to the target and that any comparison 
must be based on different degrees of mismatch between the 
four probes. Such a comparison does not always allow the 
target nucleotide corresponding to that columnar position to 
be called. Deletions in target sequences can be detected by 

35 loss of signal from probes having interrogation positions 

encompassed by the deletion. However, signal may also be lost 
from probes having interrogation positions closely proximal to 
the deletion resulting in some regions of the target sequence 
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that cannot be read. Target sequence bearing insertions will 
also exhibit short regions including and proximal to the 
insertion that usually cannot be read. 

The presence of short regions of difficult-to-read target 
5 because of closely spaced mutations, insertions or deletion, 
does not prevent determination of the remaining sequence of 
the target as different regions of a target sequence are 
determined independently. Moreover, such ambiguities as might 
result from analysis of diverse variants with a single group 

10 of probes can be avoided by including multiple groups of probe 
sets on a chip. For example, one group of probes can .be 
designed based on a full-length reference sequence, and the 
other groups on subsequences of the reference sequence 
incorporating frequently occurring mutations or strain 

15 variations. 

A particular advantage of the present sequencing strategy 
over conventional sequencing methods is the capacity 
simultaneously to detect and quantify proportions of multiple 
target sequences. Such capacity is valuable, e.g., for 

20 diagnosis of patients who are heterozygous with respect to a 
gene or who are infected with a virus, such as HIV, which is 
usually present in several polymorphic forms. Such capacity 
is also useful in analyzing targets from biopsies of tumor 
cells and surrounding tissues. The presence of multiple 

25 target sequences is detected from the relative signals of the 
four probes at the array columns corresponding to the target 
nucleotides at which diversity occurs. The relative signals 
at the four probes for the mixture under test are compared 
with the corresponding signals from a homogeneous reference 

3 0 sequence. An increase in a signal from a probe that is 
mismatched with respect to the reference sequence, and a 
corresponding decrease in the signal from the probe which is 
matched with the reference sequence signal the presence of a 
mutant strain in the mixture. The extent in shift in 

35 hybridization signals of the probes is related to the 

proportion of a target sequence in the mixture. Shifts in 
relative hybridization signals can be quantitatively related 
to proportions of reference and mutant sequence by prior 
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calibration of the chip with seeded mixtures of the mutant and 
reference sequences. By this means, a chip can be used to 
detect variant or mutant strains constituting as little as 1, 
5, 20, or 25 % of a mixture of stains. 
5 Similar principles allow the simultaneous analysis of 

multiple target sequences even when none is identical to the 
reference sequence. For example, with a mixture of two target 
sequences bearing first and second mutations, there would be a 
variation in the hybridization patterns of probes having 

10 interrogation positions corresponding to the first and second 
mutations relative to the hybridization pattern with the 
reference sequence. At each position, one of the probes 
having a mismatched interrogation position relative to the 
reference sequence would show an increase in hybridization 

15 signal, and the probe having a matched interrogation position 
relative to the reference sequence would show a decrease in 
hybridization signal. Analysis of the hybridization pattern 
of the mixture of mutant target sequences, preferably in 
comparison with the hybridization pattern of the reference 

20 sequence, indicates the presence of two mutant target 

sequences, the position and nature of the mutation in each 
strain, and the relative proportions of each strain. 

In a variation of the above method, the different 
components in a mixture of target sequences are differentially 

25 labelled before being applied to the array. For example, a 

variety of fluorescent labels emitting at different wavelength 
are available. The use of differential labels allows 
independent analysis of different targets bound simultaneously 
to the array. For example, the methods permit comparison of 

3 0 target sequences obtained from a patient at different stages 
of a disease. 

2. Omission of Probes 
The general strategy outlined above employs four probes 
35 to read each nucleotide of interest in a target sequence. One 
probe (from the first probe set) shows a perfect match to the 
reference sequence and the other three probes {from the 
second, third and fourth probe sets) exhibit a mismatch with 
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the reference sequence and a perfect match with a target 
sequence bearing a mutation at the nucleotide of interest. 
The provision of three probes from the second, third and 
fourth probe sets allows detection of each of the three 
5 possible nucleotide substitutions of any nucleotide of 

interest- However, in some reference sequences or regions of 
reference sequences, it is known in advance that only certain 
mutations are likely to occur. Thus, for example, at one site 
it might be known that an A nucleotide in the reference 

10 sequence may exist as a T mutant in some target sequences but 
is unlikely to exist as a C or G mutant. Accordingly, for 
analysis of this region of the reference sequence, one might 
include only the first and second probe sets, the first probe 
set exhibiting perfect complementarity to the reference 

15 sequence, and the second probe set having an interrogation 

position occupied by an invariant A residue (for detecting the 
T mutant) . In other situations, one might include the first, 
second and third probes sets (but not the fourth) for 
detection of a wildtype nucleotide in the reference sequence 

20 and two mutant variants thereof in target sequences. In some 
chips, probes that would detect silent mutations (i.e., not 
affecting amino acid sequence) are omitted. 

In some chips, the probes from the first probe set are 
omitted corresponding to some or all positions of the 

25 reference sequences. Such chips comprise at least two probe 
sets. The first probe set has a plurality of probes. Each 
probe comprises a segment exactly complementary to a 
subsequence of a reference sequence except in at least one 
interrogation position. A second probe set has a 

30 corresponding probe for each probe in the first probe set. 

The corresponding probe in the second probe set is identical 
to a sequence comprising the corresponding probe form the 
first probe set or a subsequence thereof that includes the at 
least one (and usually only one) interrogation position except 

35 that the at least one interrogation position is occupied by a 
different nucleotide in each of the two corresponding probes 
from the first and second probe sets. A third probe set, if 
present, also comprises a corresponding probe for each probe 
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in the first probe set except at the at least one 
interrogation position, which differs in the corresponding 
probes from the three sets. Omission of probes having a 
segment exhibiting perfect complementarity to the reference 
sequence results in loss of control information, i.e., the 
detection of nucleotides in a target sequence that are the 
same as those in a reference sequence. However, similar 
information can be obtained by hybridizing a chip lacking 
probes from the first probe set to both target and reference 
sequences. The hybridization can be performed sequentially, 
or concurrently, if the target and reference are 
differentially labelled. In this situation, the presence of a 
mutation is detected by a shift in the background 
hybridization intensity of the reference sequence to a 
perfectly matched hybridization signal of the target sequence, 
rather than by a comparison of the hybridization intensities 
of probes from the first set with corresponding probes from 
the second, third and fourth sets. 

3. Wildtype Probe Lane 

When the chips comprise four probe sets, as discussed 
supra, and the probe sets are laid down in four lanes, an A 
lane, a C-lane, a G lane and a T or U lane, the probe having a 
segment exhibiting perfect complementarity to a reference 
sequence varies between the four lanes from one column to 
another. This does not present any significant difficulty in 
computer analysis of the data from the chip. However, visual 
inspection of the hybridization pattern of the chip is 
sometimes facilitated by provision of an extra lane of probes, 
in which each probe has a segment exhibiting perfect 
complementarity to the reference sequence. See Fig. 4. This 
segment -is identical to a segment from one of the probes in 
the other four lanes (which lane depending on the column 
position) . The extra lane of probes (designated the wildtype 
lane) hybridizes to a target sequence at all nucleotide 
positions except those in which deviations from the reference 
sequence occurs. The hybridization pattern of the wildtype 
lane thereby provides a simple visual indication of mutations. 
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4. Deletion, Insertion and Multiple-Mutation Probes 
Some chips provide an additional probe set specifically 
designed for analyzing deletion mutations. The additional 
probe set comprises a probe corresponding to each probe in the 
5 first probe set as described above. However, a probe from the 
additional probe set differs from the corresponding probe in 
the first probe set in that the nucleotide occupying the 
interrogation position is deleted in the probe from the 
additional probe set. See Fig. 6. Optionally, the probe from 

10 the additional probe set bears an additional nucleotide at one 
of its termini relative to the corresponding probe from the 
first probe set. The probe from the additional probe set will 
hybridize more strongly than the corresponding probe from the 
first probe set to a target sequence having a single base 

15 deletion at the nucleotide corresponding to the interrogation 
position. Additional probe sets are provided in which not 
only the interrogation position, but also an adjacent 
nucleotide is detected. 

Similarly, other chips provide additional probe sets for 

20 analyzing insertions. For example, one additional probe set 

has a probe corresponding to each probe in the first probe set 
as described above. However, the probe in the additional 
probe set has an extra T nucleotide inserted adjacent to the 
interrogation position. See Fig. 6. Optionally, the probe 

25 has one fewer nucleotide at one of its termini relative to the 
corresponding probe from the first probe set. The probe from 
the additional probe set hybridizes more strongly than the 
corresponding probe from the first probe set to a target 
sequence having an A nucleotide inserted in a position 

30 adjacent to that corresponding to the interrogation position. 
Similar additional probe sets are constructed having C, G or 
T/U nucleotides inserted adjacent to the interrogation 
position. Usually, four such probe sets, one for each 
nucleotide, are used in combination. 

35 Other chips provide additional probes (multiple-mutation 

probes) for analyzing target sequences having multiple closely 
spaced mutations. A multiple-mutation probe is usually 
identical to a corresponding probe from the first set as 
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described above, except in the base occupying the 
interrogation position, and except at one or more additional 
positions, corresponding to nucleotides in which substitution 
may occur in the reference sequence. The one or more 
5 additional positions in the multiple mutation probe are 
occupied by nucleotides complementary to the nucleotides 
occupying corresponding positions in the reference sequence 
when the possible substitutions have occurred. 
5. Block Tiling 

10 As noted in the discussion of the general tiling 

strategy, a probe in the first probe set sometimes has more 
than one interrogation position. In this situation, a probe 
in the first probe set is sometimes matched with multiple 
groups of at least one, and usually, three additional probe 

15 sets. See Fig. 7. Three additional probe sets are used to 

allow detection of the three possible nucleotide substitutions 
at any one position. If only certain types of substitution 
are likely to occur (e.g., transitions), only one or two 
additional probe sets are required (analogous to the use of 

20 probes in the basic tiling strategy) . To illustrate for the 

situation where a group comprises three additional probe sets, 
a first such group comprises second, third and fourth probe 
sets, each of which has a probe corresponding to each probe in 
. the first probe set. The corresponding probes from the 

25 second, third and fourth probes sets differ from the 

corresponding probe in the first set at a first of the 
interrogation positions. Thus, the relative hybridization 
signals from corresponding probes from the first, second, 
third and fourth probe sets indicate the identity of the 

30 nucleotide in a target sequence corresponding to the first 
interrogation position. A second group of three probe sets 
(designated fifth, sixth and seventh probe sets) , each also 
have a probe corresponding to each probe in the first probe 
set. These corresponding probes differ from that in the first 

35 probe set at a second interrogation position. The relative 
hybridization signals from corresponding probes from the 
first, fifth, sixth, and seventh probe sets indicate the 
identity of the nucleotide in the target sequence 
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corresponding to the second interrogation position. As noted 
above, the probes in the first probe set often have seven or 
more interrogation positions. If there are seven 
interrogation positions, there are seven groups of three 
5 additional probe sets, each group of three probe sets serving 
to identify the nucleotide corresponding to one of the seven 
interrogation positions. 

Each block of probes allows short regions of a target 
sequence to be read. For example, for a block of probes 

10 having seven interrogation positions, seven nucleotides in the 
target sequence can be read. Of course, a chip can contain 
any number of blocks depending on how many nucleotides of the 
target are of interest. The hybridization signals for each 
block can be analyzed independently of any other block. The 

15 block tiling strategy can also be combined with other tiling 
strategies, with different parts of the same reference 
sequence being tiled by different strategies. 

The block tiling strategy offers two advantages over the 
basic strategy in which each probe in the first set has a 

20 single interrogation position. One advantage is that the same 
sequence information can be obtained from fewer probes. A 
second advantage is that each of the probes constituting a 
block (i.e., a probe from the first probe set and a 
corresponding probe from each of the other probe sets) can 

25 have identical 3* and 5* sequences, with the variation 

confined to a central segment containing the interrogation 
positions. The identity of 3' sequence between different 
probes simplifies the strategy for solid phase synthesis of 
the probes on the chip and results in more uniform deposition 

30 of the different probes on the chip, thereby in turn 

increasing the uniformity of signal to noise ratio for 
different regions of the chip. A third advantage is that 
greater signal uniformity is achieved within a block. 

35 6. Multiplex Tiling 

In the block tiling strategy discussed above, the 
identity of a nucleotide in a target or reference sequence is 
determined by comparison of hybridization patterns of one 
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probe having a segment showing a perfect match with that of 
other probes (usually three other probes) showing a single 
base mismatch. In multiplex tiling, the identity of at least 
two nucleotides in a reference or target sequence is 
1 5 determined by comparison of hybridization signal intensities 

of four probes, two of which have a segment showing perfect 
* complementarity or a single base mismatch to the reference 

sequence, and two of which have a segment showing perfect 
complementarity or a double-base mismatch to a segment. The 

10 four probes whose hybridization patterns are to be compared 
each have a segment that is exactly complementary to a 
reference sequence except at two interrogation positions, in 
which the segment may or may not be complementary to the 
reference sequence. The interrogation positions correspond to 

15 the nucleotides in a reference or target sequence which are 

determined by the comparison of intensities. The nucleotides 
occupying the interrogation positions in the four probes are 
selected according to the following rule. The first 
interrogation position is occupied by a different nucleotide 

20 in each of the four probes. The second interrogation position 
is also occupied by a different nucleotide in each of the four 
probes. In two of the four probes, designated the first and 
second probes, the segment is exactly complementary to the 
reference sequence except at not more than one of the two 

25 interrogation positions. In other words, one of the 

interrogation positions is occupied by a nucleotide that is 
complementary to the corresponding nucleotide from the 
reference sequence and the other interrogation position may or 
may not be so occupied. In the other two of the four probes, 

30 designated the third and fourth probes, the segment is exactly 
complementary to the reference sequence except that both 
interrogation positions are occupied by nucleotides which are 
, noncomplementary to the respective corresponding nucleotides 

in the reference sequence. 

35 There are number of ways of satisfying these conditions 

depending on whether the two nucleotides in the reference 
sequence corresponding to the two interrogation positions are 
the same or different. If these two nucleotides are different 
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in the reference sequence (probability 3/4), the conditions 
are satisfied by each of the two interrogation positions being 
occupied by the same nucleotide in any given probe. For 
example, in the first probe, the two interrogation positions 
5 would both be A, in the second probe, both would be C, in the 
third probe, each would be G, and in the fourth probe each 
would be T or U. If the two nucleotides in the reference 
sequence corresponding to the two interrogation positions are 
different, the conditions noted above are satisfied by each of 

10 the interrogation positions in any one of the four probes 

being occupied by complementary nucleotides. For example, in 
the first probe, the interrogation positions could be occupied 
by A and T, in the second probe by C and G, in the third probe 
by G and C, and in the four probe, by T and A. See (Fig. 8) . 

15 When the four probes are hybridized to a target that is 

the same as the reference sequence or differs from the 
reference sequence at one (but not both) of the interrogation 
positions, two of the four probes show a double-mismatch with 
the target and two probes show a single mismatch. The 

20 identity of probes showing these different degrees of mismatch 
can be determined from the different hybridization signals. 
From the identity of the probes showing the different degrees 
of mismatch, the nucleotides occupying both of the 
interrogation positions in the target sequence can be deduced. 

25 For ease of illustration, the multiplex strategy has been 

initially described for the situation where there are two 
nucleotides of interest in a reference sequence and only four 
probes in an array. Of course, the strategy can be extended 
to analyze any number of nucleotides in a target sequence by 

30 u6ing additional probes. In one variation, each pair of 

interrogation positions is read from a unique group of four 
probes. In a block variation, different groups of four probes 
exhibit the same segment of complementarity with the reference 
sequence, but the interrogation positions move within a block. 

35 The block and standard multiplex tiling variants can of course 
be used in combination for different regions of a reference 
sequence. Either or both variants can also be used in 
combination with any of the other tiling strategies described. 
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7, Helper Mutations 

Occasionally small regions of a reference sequence give a 
low hybridization signal as a result of annealing of probes. 
The self-annealing reduces the amount of probe effectively 
5 available for hybridizing to the target. Although such 

regions of the target are generally small and the reduction of 
hybridization signal is usually not so substantial as to 
obscure the sequence of this region, this concern can be 
avoided by the use of probes incorporating helper mutations. 

10 The helper mutation (s) serve to break-up regions of internal 
complementarity within a probe and thereby prevent annealing. 
Usually, one or two helper mutations are quite sufficient for 
this purpose. The inclusion of helper mutations can be 
beneficial in any of the tiling strategies noted above. In 

15 general each probe having a particular interrogation position 
has the same helper mutation{s). Thus, such probes have a 
segment in common which shows perfect complementarity with a 
reference sequence, except that the segment contains at least 
one helper mutation (the same in each of the probes) and at 

20 least one interrogation position (different in all of the 

probes) . For example, in the basic tiling strategy, a probe 
from the first probe set comprises a segment containing an 
interrogation position and showing perfect complementarity 
with a reference sequence except for one or two helper 

25 mutations. The corresponding probes from the second, third 
and fourth probe sets usually comprise the same segment (or 
sometimes a subsequence thereof including the helper 
mutation(s) and interrogation position), except that the base 
occupying the interrogation position varies in each probe. 

30 See Fig. 9. 

Usually, the helper mutation tiling strategy is used in 
conjunction with one of the tiling strategies described above. 
The probes containing helper mutations are used to tile 
regions of a reference sequence otherwise giving low 

35 hybridization signal (e.g., because of self -complementarity ) , 
and the alternative tiling strategy is used to tile 
intervening regions. 
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8. Pooling Strategies 

Pooling strategies also employ arrays of immobilized 
probes. Probes are immobilized in cells of an array, and the 
hybridization signal of each cell can be determined 
5 independently of any other cell. A particular cell may be 

occupied by pooled mixture of probes. Although the identity 
of each probe in the mixture is known, the individual probes 
in the pool are not separately addressable. Thus, the 
hybridization signal from a cell is the aggregate of that of 

10 the different probes occupying the cell. In general, a cell 
is scored as hybridizing to a target sequence if at least one 
probe occupying the cell comprises a segment exhibiting 
perfect complementarity to the target sequence. 

A simple strategy to show the increased power of pooled 

15 strategies over a standard tiling is to create three cells 
each containing a pooled probe having a single pooled 
position, the pooled position being the same in each of the 
pooled probes. At the pooled position, there are two possible 
nucleotide, allowing the pooled probe to hybridize to two 

20 target sequences. In tiling terminology, the pooled position 
of each probe is an interrogation position. As will become 
apparent, comparison of the hybridization intensities of the 
pooled probes from the three cells reveals the identity of the 
nucleotide in the target sequence corresponding to the 

25 interrogation position (i.e., that is matched with the 

interrogation position when the target sequence and pooled 
probes are maximally aligned for complementarity) . 

The three cells are assigned probe pools that are 
perfectly complementary to the target except at the pooled 

30 position, which is occupied by a different pooled nucleotide 
in each probe as follows: 
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[AC] = M, [GT]=K, [AG]=R 

as substitutions in the probe 

lUPAC standard ambiguity notation) 

X - interrogation position 
Target : TAACCACTCACGGGAGCA 

Pool 1: ATTGGMGAGTGCCC 

=ATTGGaGAGTGCCC (complement to mutant 't') 

4-ATTGGcGAGTGCCC (complement to mutant 'g') 

Pool 2: ATTGGKGAGTGCCC 

=ATTGGgGAGTGCCC (complement to mutant 'c») 

+ATTGGtGAGTGCCC (complement to wild type 'a') 

15 Pool 3: ATTGGRGAGTGCCC 

=ATTGGaGAGTGCCC (complement to mutant 't') 

+ATTGGgGAGTGCCC (complement to mutant 'c') 

20 With 3 pooled probes, all 4 possible single base pair states 
(wild and 3 mutants) are detected. A pool hybridizes with a 
target if some probe contained within that pool is 
complementary to that target. 

25 Hybridization? 

Pool: 12 3 

Target: TAACCACTCACGGGAGCA n y n 

Mutant: TAACCcCTCACGGGAGCA n y y 

Mutant: TAACCgCTCACGGGAGCA y n n 

3 0 Mutant: TAACCtCTCACGGGAGCA Y n y 

A cell containing a pair (or more) of oligonucleotides 
lights up when a target complementary to any of the 
oligonucleotide in the cell is present. Using the simple 
35 strategy, each of the four possible targets (wild and three 

mutants) yields a unique hybridization pattern among the three 
cells. 

Since a different pattern of hybridizing pools is 
obtained for each possible nucleotide in the target sequence 
40 corresponding to the pooled interrogation position in the 

probes, the identity of the nucleotide can be determined from 
the hybridization pattern of the pools. Whereas, a standard 
tiling requires four cells to detect and identify the possible 
single-base substitutions at one location, this simple pooled 

4 5 strategy only requires three cells. 
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A more efficient pooling strategy for sequence analysis 
is the 'Trellis' strategy. In this strategy, each pooled 
probe has a segment of perfect complementarity to a reference 
sequence except at three pooled positions. One pooled 
5 position is an N pool. The three pooled positions may or may 
not be contiguous in a probe. The other two pooled positions 
are selected from the group of three pools consisting of (1) M 
or K, (2) R or Y and (3) W or S, where the single letters are 
lUPAC standard ambiguity codes. The sequence of a pooled 

10 probe is thus, of the form XXXN[(M/K) or (R/Y) or (W/S) ] [ (M/K) 
or (R/Y) or (W/S)]XXXXX, where XXX represents bases 
complementary to the reference sequence. The three pooled 
positions may be in any order, and may be contiguous or 
separated by intervening nucleotides. For, the two positions 

15 occupied by [(M/K) or (R/Y) or (W/S)], two choices must be 

made. First, one must select one of the following three pairs 
of pooled nucleotides (1) M/K, (2) R/Y and (3) W/S. The one 
of three pooled nucleotides selected may be the same or 
different at the two pooled positions. Second, supposing, for 

20 example, one selects M/K at one position, one must then chose 
between M or K. This choice should result in selection of a 
pooled nucleotide comprising a nucleotide that complements the 
corresponding nucleotide in a reference sequence, when the 
probe and reference sequence are maximally aligned. The same 

25 principle governs the selection between R and Y, and between W 
and S. A trellis pool probe has one pooled position with four 
possibilities, and two pooled positions, each with two 
possibilities. Thus, a trellis pool probe comprises a mixture 
of 16 (4x2x2) probes. Since each pooled position includes 

30 one nucleotide that complements the corresponding nucleotide 
from the reference sequence, one of these 16 probes has a 
segment that is the exact complement of the reference 
sequence. A target sequence that is the same as the reference 
sequence (i.e., a wildtype target) gives a hybridization 

35 signal to each probe cell. Here, as in other tiling methods, 
the segment of complementarity should be sufficiently long to 
permit specific hybridization of a pooled probe to a reference 
sequence be detected relative to a variant of that reference 
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sequence. Typically, the segment of complementarity is about 
9-21 nucleotides. 

A target sequence is analyzed by comparing hybridization 
intensities at three pooled probes, each having the structure 
described above. The segments complementary to the reference 
sequence present in the three pooled probes show some overlap. 
Sometimes the segments are identical (other than at the 
interrogation positions) . However, this need not be the case. 
For example, the segments can tile across a reference sequence 
in increments of one nucleotide (i.e., one pooled probe 
differs from the next by the acquisition of one nucleotide at 
the 5' end and loss of a nucleotide at the 3' end). The three 
interrogation positions may or may not occur at the same 
relative positions within each pooled probe (i.e., spacing 
from a probe terminus) . All that is required is that one of 
the three interrogation positions from each of the three 
pooled probes aligns with the same nucleotide in the reference 
sequence, and that this interrogation position is occupied by 
a different pooled nucleotide in each of the three probes. In 
one of the three probes, the interrogation position is 
occupied by an N. In the other two pooled probes the 
interrogation position is occupied by one of (M/K) or (R/Y) or 
(W/S) . 

In the simplest form of the trellis strategy, three 
pooled probes are used to analyze a single nucleotide in the 
reference sequence. Much greater economy of probes is 
achieved when more pooled probes are included in an array. 
For example, consider an array of five pooled probes each 
having the general structure outlined above. Three of these 
pooled probes have an interrogation position that aligns with 
the same nucleotide in the reference sequence and are used to 
read that nucleotide. A different combination of three probes 
have an interrogation position that aligns with a different 
nucleotide in the reference sequence. Comparison of these 
three probe intensities allows analysis of this second 
nucleotide. Still another combination of three pooled probes 
from the set of five have an interrogation position that 
aligns with a third nucleotide in the reference sequence and 
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these probes are used to analyze that nucleotide. Thus, three 
nucleotides in the reference sequence are fully analyzed from 
only five pooled probes. By comparison, the basic tiling 
strategy would require 12 probes for a similar analysis. 

As an example, a pooled probe for analysis of a target 
sequence by the trellis strategy is shown below: 

Target : ATTAACCACTCACGGGAGCTCT 
Pool : TGGTGNKYGCCCT 

The pooled probe actually comprises 16 individual probes: 

TGGTGAGcGCCCT 
+TGGTGcGcGCCCT 
+TGGTGgGcGCCCT 
+TGGTGtGcGCCCT 
+TGGTGAtcGCCCT 
+TGGTGctcGCCCT 
+TGGTGgtcGCCCT 
+TGGTGttcGCCCT 
+TGGTGAGTGCCCT 
+TGGTGCGTGCCCT 
+TGGTGgGTGCCCT 
+TGGTGtGTGCCCT 
+TGGTGAtTGCCCT 
+TGGTGctTGCCCT 
+TGGTGgtTGCCCT 
+TGGTGttTGCCCT 

The trellis strategy employs an array of probes having at 
least three cells, each of which is occupied by a pooled probe 
as described above. 

Consider the use of three such pooled probes for 
analyzing a target sequence, of which one position may contain 
any single base substitution to the reference sequence (i.e, 
there are four possible target sequences to be distinguished) . 
Three cells are occupied by pooled probes having a pooled 
interrogation position corresponding to the position of 
possible substitution in the target sequence, one cell with an 
•N', one cell with one of 'M' or 'K', and one cell with 'R» or 
•Y*. An interrogation position corresponds to a nucleotide in 
the target sequence if it aligns adjacent with that nucleotide 
when the probe and target sequence are aligned to maximize 
complementarity. Note that although each of the pooled 
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probes has two other pooled positions, these positions are not 
relevant for the present illustration. The positions are only 
relevant when more than one position in the target sequence is 
to be read, a circumstance that will be considered later. For 
5 present purposes, the cell with the 'N* in the interrogation 
position lights up for the wildtype sequence and any of the 
three single base substitutions of the target sequence. The 
cell with M/K in the interrogation position lights up for the 
wildtype sequence and one of the single-base substitutions. 

10 The cell with R/Y in the interrogation position lights up for 
the wildtype sequence and a second of the single-base 
substitutions. Thus, the four possible target sequences 
hybridize to the three pools of probes in four distinct 
patterns, and the four possible target sequences can be 

15 distinguished. 

To illustrate further, consider four possible target 
sequences (differing at a single position) and a pooled probe 
having three pooled positions, N, K and Y with the Y position 
as the interrogation position (i.e., aligned with the variable 

20 position in the target sequence) : 
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Target 

Wild: ATTAACCACTCACGGGAGCTCT (w) 

Mutants : ATTAACCACTCcCGGGAGCTCT ( c ) 
Mutants : ATTAACCACTCgCGGGAGCTCT ( g ) 
Mutants : ATTAACCACTCtCGGGAGCTCT ( t ) 

TGGTGNKYGCCCT (pooled probe) . 

The sixteen individual component probes of the pooled probe 
hybridize to the four possible target seguences as follows: 

TARGET 





w 


c 


g 


4. 


TGGTGAGcGCCCT 


n 


n 


y 


n 


TGGTGcGcGCCCT 


n 


n 


n 


n 


TGGTGgGcGCCCT 


n 


n 


n 


n 


TGGTGtGcGCCCT 


n 


n 


n 


n 


TGGTGAtcGCCCT 


n 


n 


n 


n 


TGGTGctcGCCCT 


n 


n 


n 


n 


TGGTGgtcGCCCT 


n 


n 


n 


n 


TGGTGttcGCCCT 


n 


n 


n 


n 


TGGTGAGTGCCCT 


y 


n 


n 


n 


TGGTGcGTGCCCT 


n 


n 


n 


n 


TGGTGgGTGCCCT 


n 


n 


n 


n 


TGGTGtGTGCCCT 


n 


n 


n 


n 


TGGTGAtTGCCCT 


n 


n 


n 


n 


TGGTGctTGCCCT 


n 


n 


n 


n 


TGGTGgtTGCCCT 


n 


n 


n 


n 


TGGTGttTGCCCT 


n 


n 


n 


n 



The pooled probe hybridizes according to the aggregate of its 
components : 

Pool: TGGTGNKYGCCCT y n y n 

Thus, as stated above, it can be seen that a pooled probe 
having a y at the interrogation position hybridizes to the 
wildtype target and one of the mutants. Similar tables can be 
drawn to illustrate the hybridization patterns of probe pools 
having other pooled nucleotides at the interrogation position. 

The above strategy of using pooled probes to analyze a 
single base in a target sequence can readily be extended to 
analyze any number of bases. At this point, the purpose of 
including three pooled positions within each probe will become 
apparent. In the example that follows*, ten pools of probes, . 
each containing three pooled probe positions, can be used to 
analyze a each of a contiguous sequence of eight nucleotides 
in a target sequence. 
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10 



15 



20 



25 



ATTAACCACTCACGGGAGCTCT Reference sequence 
Readable nucleotides 

Pools: 

4 TAATTNKYGAGTG 

5 AATTGNKRAGTGC 

6 ATTGGNKRGTGCC 

7 TTGGTNMRTGCCC 

8 TGGTGNKYGCCCT 

9 GGTGANKRCCCTC 

10 GTGAGNKYCCTCG 

11 TGAGTNMYCTCGA 

1 2 GAGTGNMYTCGAG 

13 AGTGCNMYCGAGA 



In this example, the different pooled probes tile across 
the reference sequence, each pooled probe differing from the 
next by increments of one nucleotide. For each of the 
readable nucleotides in the reference sequence, there are 
three probe pools having a pooled interrogation position 
aligned with the readable nucleotide. For example, the 12th 
nucleotide from the left in the reference sequence is aligned 
with pooled interrogation positions in pooled probes 8, 9, and 
10. Comparison of the hybridization intensities of these 
pooled probes reveals the identity of the nucleotide occupying 
position 12 in a target sequence. 



Pools 



30 



35 



40 





Targets 


8 


9 


10 


Wild: 


ATTAACCACTCACGGGAGCTCT 


Y 


Y 


Y 


Mutants : 


ATTAACCACTCcCGGGAGCTCT 


N 


Y 


Y 


Mutants : 


ATTAACCACTCgCGGGAGCTCT 


Y 


N 


Y 


Mutants : 


ATTAACCACTCtCGGGAGCTCT 


N 


N 


Y 



Example Intensities: 





= lit cell 


Wild 












= blank cell 


•C 














•G' 




























None 
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Thus, for example, if pools 8, 9 and 10 all light up, one 
knows the target sequence is wildtype, If pools, 9 and 10 
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light up, the target sequence has a C mutant at position 12. 
If pools 8 and 10 light up, the target sequence has a G mutant 
at position 12. If only pool 10 lights up, the target 
sequence has a t mutant at position 12. 
5 The identity of other nucleotides in the target sequence 

is determined by a comparison of other sets of three pooled 
probes. For example, the identity of the 13th nucleotide in 
the target sequence is determined by comparing the 
hybridization patterns of the probe pools designated 9, 10 and 

10 11. Similarly, the identity of the 14th nucleotide in the 

target sequence is determined by comparing the hybridization 
patterns of the probe pools designated 10, 11, and 12. 

In the above example, successive probes tile across the 
reference sequence in increments of one nucleotide, and each 

15 probe has three interrogation positions occupying the same 

positions in each probe relative to the terminus of the probe 
(i.e., the 7, 8 and 9th positions relative to the 3* 
terminus) . However, the trellis strategy does not require 
that probes tile in increments of one or that the 

20 interrogation position positions occur in the same position in 
each probe. In a variant of trellis tiling referred to as 
"loop" tiling, a nucleotide of interest in a target sequence 
is read by comparison of pooled probes, which each have a 
pooled interrogation position corresponding to the nucleotide 

25 of interest, but in which the spacing of the interrogation 
position in the probe differs from probe to probe. 
Analogously to the block tiling approach, this allows several 
nucleotides to be read from a target sequence from a 
collection of probes that are identical except at the 

30 interrogation position. The identity in sequence of probes, 
particularly at their 3* termini, simplifies synthesis of the 
array and result in more uniform probe density per cell. 

To illustrate the loop strategy, consider a reference 
sequence of which the 4, 5, 6, 7 and 8th nucleotides (from the 

35 3' termini are to be read. All of the four possible 

nucleotides at each of these positions can be read from 
comparison of hybridization intensities of five pooled probes. 
Note that the pooled positions in the probes are different 
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(for example in probe 55, the pooled positions are 4, 5 and 6 

and in probe 56, 5, 6 and 7) . 

TAACCACTCACGGGAGCA Reference sequence 
55 ATTNKYGAGTGCC 
5 56 ATTGNKRAGTGCC 

57 ATTGGNKRGTGCC 

58 ATTRGTNMGTGCC 

59 ATTKRTGNGTGCC 

10 Each position of interest in the reference sequence is read by 
comparing hybridization intensities for the three probe pools 
that have an interrogation position aligned with the 
nucleotide of interest in the reference sequence. For 
example, to read the fourth nucleotide in the reference 

15 sequence, probes 55, 58 and 59 provide pools at the fourth 
position. Similarly, to read the fifth nucleotide in the 
reference sequence, probes 55, 56 and 59 provide pools at the 
fifth position. As in the previous trellis strategy, one of 
the three probes being compared has an N at the pooled 

20 position and the other two have M or K, and (2) R or Y and (3) 
W or S. 

The hybridization pattern of the five pooled probes to 
target sequences representing each possible nucleotide 
substitution at five positions in the reference sequence is 
25 shown below. Each possible substitution results in a unique 

hybridization pattern at three pooled probes, and the identity 
of the nucleotide at that position can be deduced from the 
hybridization pattern. 
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5 



25 









Pools 








Targets 


55 


56 


57 


58 


59 


Wild: 


TAACCACTCACGGGAGCA 


V 
X 


Y 


Y 


V 

X 


V 
X 


Mutant : 


T AAgC ACTC ACGGG AG CA 


Y 


N 


N 


N 


N 


Mutant: 


TAAtCACTCACGGGAGCA 


Y 


VT 


XT 


Y 


N 


Mutant : 


TAAaCACTCACGGGAGCA 


Y 


N 


N 


N 


XT 
1 


Mutant : 


TAACgACTCACGGGAGCA 


N 


Y 


N 


N 


N 


Mutant: 


TAACtACTCACGGGAGCA 


N 


V 
X 


VT 


N 


Y 


Mutant : 


TAACaACTCACGGGAGCA 


\7 
I 


Y 


N 




XT 


Mutant : 


TAACCcCTCACGGGAGCA 


N 


Y 


Y 


N 


N 


Mutant : 


TAACCgCTCACGGGAGCA 


Y 


XT 


V 
1 


N 


N 


Mutant : 


TAACCtCTCACGGGAGCA 


111 

N 


N 


Y 


XT 


XI 


Mutant : 


TAACCAgTCACGGGAGCA 


N 


N 


N 


Y 


N 


Mutant : 


TAACCAtTCACGGGAGCA 


N 


Y 


N 


Y 


N 


Mutant : 


TAACC AaTCACGGG AG CA 


N 


N 


Y 


Y 


N 


Mutant: 


TAACCACaCACGGGAGCA 


N 


N 


N 


N 


Y 


Mutant : 


TAACCACCCACGGGAGCA 


N 


N 


Y 


N 


Y 


Mutant : 


TAACCACgCACGGGAGCA 


N 


N 


N 


Y 


V 
X 


Many 


variations on the loop 


and trellis 


tilings 


can 


created. 


All that is required is 


that 


each position 


in 



sequence must have a probe with a 'N", a probe containing one 

3 0 of R/Y, M/K or W/S, and a probe containing a different pool 
from that set, complementary to the wild type target at that 
position, and at least one probe with no pool at all at that 
position. This combination allows all mutations at that 
position to be uniquely detected and identified. 

35 A further class of strategies involving pooled probes are 

termed coding strategies. These strategies assign code words 
from some set of numbers to variants of a reference sequence. 
Any number of variants can be coded. The variants can include 
multiple closely spaced substitutions, deletions or 

40 insertions. The designation letters or other symbols assigned 
to each variant may be any arbitrary set of numbers, in any 
order. For example, a binary code is often used, but codes to 
other bases are entirely feasible. The numbers are often 
assigned such that each variant has a designation having at 

45 least one digit and at least one nonzero value for that digit. 
For example, in a binary system, a variant assigned the number 
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101, has a designation of three digits / with one possible 
nonzero value for each digit. 

The designation of the variants are coded into an array 
of pooled probes comprising a pooled probe for each nonzero 

5 value of each digit in the numbers assigned to the variants. 

For example, if the variants are assigned successive number in 
a numbering system of base m, and the highest number assigned 
to a variant has n digits, the array would have about n x (m- 
1) pooled probes. In general, log^^ (3N+1) probes are required 

10 to analyze all variants of N locations in a reference 

sequence, each having three possible mutant substitutions. 
For example, 10 base pairs of sequence may be analyzed with 
only 5 pooled probes using a binary coding system. 
Each pooled probe has a segment exactly complementary to the 

15 reference sequence except that certain positions are pooled. 
The segment should be sufficiently long to allow specific 
hybridization of the pooled probe to the reference sequence 
relative to a mutated form of the reference sequence. As in 
other tiling strategies, segments lengths of 9-21 nucleotides 

20 are typical. Often the probe has no nucleotides other than 
the 9-21 nucleotide segment. The pooled positions comprise 
nucleotides that allow the pooled probe to hybridize to every 
variant assigned a particular nonzero value in a particular 
digit. Usually, the pooled positions further comprises a 

25 nucleotide that allows the pooled probe to hybridize to the 
reference sequence. Thus, a wildtype target (or reference 
sequence) is immediately recognizable from all the pooled 
probes being lit. 

When a target is hybridized to the pools, only those 

30 pools comprising a component probe having a segment that is 

exactly complementary to the target light up. The identity of 
the target is then decoded from the pattern of hybridizing 
pools. Each pool that lights up is correlated with a 
particular value in a particular digit. Thus, the aggregate 

35 hybridization patterns of each lighting pool reveal the value 
of each digit in the code defining the identity of the target 
hybridized to the array. 
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AS an example, consider a reference sequence having four 
positions, each of which can be occupied by three possible 
mutations. Thus, in total there are 4 x 3 possible variant 
forms of the reference sequence. Each variant is assigned a 
binary number binary numbers 0001-1100 and the wildtype 
reference sequence is assigned the binary number 1111. 



Positions 

Target: TAAC C=llll 
CACGGGAGCA 

G=0001 
T=0101 
A=1001 



X 

A=llll 

C=0010 
G=0110 
T=1010 



X 

C=llll 

G=0011 
T-0111 
A=1011 



X 

T=llll 

A=0100 
C=1000 
G=1100 



A first pooled probe is designed by including probes that 
complement exactly each variant having a 1 in the first digit. 



target 
Mutant 
Mutant 
Mutant 
Mutant 
Mutant 
Mutant 



(1111) 
(0001) 
(0101) 
(1001) 
(0011) 
(0111) 
(1101) 



TAAC 
TAAC 
TAAC 
TAAC 
TAAC 
TAAC 
TAAC 



First pooled probe 
ATTG 
ATTG 



C 
C 
C 



[GCAT] 
N 



C 
C 

c 
c 



A 
A 
A 
A 
A 
A 
A 



T [GGAT] 
T N 



T CACGGGAGCA 
T CACGGGAGCA 
T CACGGGAGCA 
T CACGGGAGCA 
T CACGGGAGCA 
T CACGGGAGCA 
T CACGGGAGCA 



A GTGCCC 
A GTGCCC 



Second, third and fourth pooled probes are then designed 
respectively including component probes that hybridize to each 
variant having a 1 in the second, third and fourth digit. 

XXXX - 4 positions examined 



Target : 
Pool 1(1) : 
Pool 2(2) : 
Pool 3(4) : 
Pool 4(8) : 



TAACCACTCACGGGAGCA 
ATTGnTnAGTGCCC = 
ATTGGnnAGTGCCC = 
ATTGyrydGTGCCC = 
ATTGmwmbGTGCCC = 



16 probes 

16 probes 

24 probes 

24 probes 



(4X1X4X1) 
(1x4x4x1) 
(2x2x2x3) 
(2x2x2x3) 
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The pooled probes hybridize to variant targets as follows: 
Hybridization pattern: 



10 









Pools 






Targets 


1 


2 


3 


4 


Wild(llll) 


TAACCACTCACGGGAGCA 


Y 


Y 


Y 


Y 


Mutant (0001) : 


TAACgACTCACGGGAGCA 


Y 


N 


N 


N 


Mutant (0101) : 


TAACtACTCACGGGAGCA 


Y 


N 


Y 


N 


Mutant (1001) : 


TAACaACTCACGGGAGCA 


Y 


N 


N 


Y 


Mutant (0010) : 


TAACCcCTCACGGGAGCA 


N 


Y 


N 


N 


Mutant (0110) : 


TAACCgCTCACGGGAGCA 


N 


Y 


Y 


N 


Mutant (1010) : 


TAACCtCTCACGGGAGCA 


N 


Y 


N 


Y 



15 


Mutant (0011) : 


TAACCAgTCACGGGAGCA 


Y 


Y 


N 


N 




Mutant (0111) : 


TAACCAtTCACGGGAGCA 


Y 


Y 


Y 


N 




Mutant (1101) : 


TAACCAaTCACGGGAGCA 


Y 


N 


Y 


Y 




Mutant (0100) : 


TAACCACaCACGGGAGCA 


N 


N 


Y 


N 


20 


Mutant (1000) : 


TAACCACcCACGGGAGCA 


N 


N 


N 


Y 




Mutant (1100) : 


TAACCACgCACGGGAGCA 


N 


N 


Y 


Y 



25 



30 



35 



40 



45 



The identity of a variant (i.e., lautant) target is read 
directly from the hybridization pattern of the pooled probes. 
For example the mutant assigned the number 0001 gives a 
hybridization pattern of NNNY with respect to probes 4, 3, 2 
and 1 respectively. 

In the above example, variants are assigned successive 
numbers in a numbering system. In other embodiments, sets of 
numbers can be chosen for their properties. If the codewords 
are chosen from an error-control code, the properties of that 
code carry over to sequence analysis. An error code is a 
numbering system in which some designations are assigned to 
variants and other designations serve to indicate errors that 
may have occurred in the hybridization process. For example, 
if all codewords have an odd number of nonzero digits ('binary 
coding+error detection'), any single error in hybridization 
will be detected by having an even number of pools lit- 



Wild 
Target: 

Pool 1(1) 

Pool 2(2) 

Pool 3(4) 

Pool 4(8) 



TAACCACTCACGGGAGCA 

ATTGnAnAGTGCCC = 
ATTGGnnAGTGCCC = 
ATTGryrhGTGCCC = 
ATTGkw)cvGTGCCC = 



16 Probes 
16 Probes 
'24 Probes 
24 Probes 



(4x1x4x1) 
(1X4X4X1) 
(2X2X2X3) 
(2X2X2X3) 
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A fifth probe can be added to make the number of pools that 
hybridize to any single mutation odd. 



Pool 5(c): ATTGdhsmGTGCCC = 36 probes 
Hybridization of pooled probes to targets 



(2x2x3x3) 



Target (11111) 
Mutant (00001) 
Mutant (10101) 



Target 
TAACCACTC ACGGG AG C A 
TAACg ACTC ACGGG AG C A 
TAACtACTCACGGGAGCA 



Mutant (11001) : TAACaACTCACGGGAGCA 



Mutant (00010) 
Mutant (10110) 
Mutant (11010) 

Mutant (10011) 
Mutant (00111) 
Mutant (01101) 



TAACCcCTCACGGGAGCA 
TAACCgCTC ACGGG AG C A 
TAACCtCTCACGGGAGCA 

TAACCAgTCACGGGAGCA 

TAACAtTCACGGGAGCA 

TAACCAaTCACGGGAGCA 



Mutant ( 00100) : TAACCACaCACGGGAGCA 
Mutant(OlOOO) : TAACCAcCCACGGGAGCA 
Mutant ( 11100) : TAACCACgCACGGGAGCA 







Pool 






1 


2 


3 


4 


5 


Y 


Y 


Y 


Y 


Y 


Y 


N 


N 


N 


N 


Y 


N 


N 


N 


N 


Y 


N 


N 


Y 


Y 


N 


Y 


N 


N 


N 


N 


Y 


Y 


N 


Y 


N 


Y 


N 


Y 


Y 


Y 


Y 


N 


N 


Y 


Y 


Y 


Y 


N 


N 


Y 


N 


Y 


Y 


N 


N 


N 


Y 


N 


N 


N 


N 


N 


Y 


N 


N 


N 


Y 


Y 


Y 



9. Bridging strategy 

Probes that contain partial matches to two separate 
(i.e., non contiguous) subsequences of a target sequence 
sometimes hybridize strongly to the target sequence. In 
certain instances, such probes have generated stronger signals 
than probes of the same length which are perfect matches to 
the target sequence. It is believed (but not necessary to the 
invention) that this observation results from interactions of 
a single target sequence with two or more probes 
simultaneously. This invention exploits this observation to 
provide arrays of probes having at least first and second 
segments, which are respectively complementary to first and 
second subsequences of a reference sequence. Optionally, the 
probes may have a third or more complementary segments. These 
probes can be employed in any of the strategies noted above. 
The two segments of such a probe can be complementary to 
disjoint subsequences of the reference sequences or contiguous 
subsequences. . If the latter, the two segments in the probe 
are inverted relative to the order of the complement of the 
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reference sequence. The two subsequences of the reference 
sequence each typically comprises about 3 to 30 contiguous 
nucleotides. The subsequences of the reference sequence are 
sometimes separated by 0, 1, 2 or 3 bases. Often the 
sequences, are adjacent and nonover lapping. 

For example, a wild-type probe is created by 
complementing two sections of a reference sequence (indicated 
by subscript and superscript) and reversing their order. The 
interrogation position is designated (*) and is apparent from 
comparison of the structure of the wildtype probe with the 
three mutant probes. The corresponding nucleotide in the 
reference sequence is the "a" in the superscripted segment. 

Reference: 5' T^qcta^^^^^AATCATCTCTTA 

Probes: 3' GCTCC CCGAT (Probe from first probe set) 

3 » GCACC CCGAT 

3 • GCCCC CCGAT 

3 ' GCGCC CCGAT 

The expected hybridizations are: 
Match: 

GCTCCCCGAT 

. . . TGGCTACGAGGAATCATCTGTTA 
GCTCC CCGAT 

Mismatch: 

GCTCC CCGAT 

TGGCTACGAGGAATCATCTGTTA 

GCGCCCCGAT 

Bridge tilings are specified using a notation which gives 
the length of the two constituent segments and the relative 
position of the interrogation position. The designation n/m 
indicates a segment complementary to a region of the reference 
sequence which extends for n bases and is located such that 
the interrogation position is in the mth base from the 5* end. 
If m is larger than n, this indicates that the entire segment 
is to the 5' side of the interrogation position. If m is 
negative, it indicates that the interrogation position is the 
absolute value of m bases 5* of the first base of the segment 
(m cannot be zero) . Probes comprising multiple segments, such 
as n/m + a/b + ... have a first segment at the 3' end of the 
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probe and additional segments added 5' with respect to the 
first segment. For example, a 4/8 tiling consists of (from 
the 3* end of the probe) a 4 base complementary segment, 
starting 7 bases 5' of the interrogation position, followed by 
a 6 base region in which the interrogation position is located 
at the third base. Between these two segments, one base from 
the reference sequence is omitted. By this notation, the set 
shown above is a 5/3 + 5/8 tiling. Many different tilings are 
possible with this method, since the lengths of both segments 
can be varied, as well as their relative position (they may be 
in either order and there may be a gap between them) aiid their 
location relative to the interrogation position. 

As an example, a 16 mer oligo target was hybridized to a 
chip containing all 4^° probes of length 10. The chip 
includes short tilings of both standard and bridging types. 
The data from a standard 10/5 tiling was compared to data from 
a 5/3 + 5/8 bridge tiling (see Table 1). Probe intensities 
(mean count/pixel) are displayed along with discrimination 
ratios (correct probe intensity / highest incorrect probe 
intensity). Missing intensity values are less than 50 counts. 
Note that for each base displayed the bridge tiling has a 
higher discrimination value. 

TABLE 1: CompariBon of Standard and Bridge Tilings 
TILING PROBE BASE: CORRECT PROBE BASE 



STANDARD 
(10/5) 

DISCRIMINATION: 

BRIDGING 
5/3 + 5/8 

DISCRIMINATION: 

The bridging strategy offers the following advantages: 
(1) Higher discrimination between matched and mismatched 
probes. 
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536 


148 
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534 
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69 


167 


72 


52 
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126 
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3.0 


1.8 


1-8 


A 




404 




156 
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276 




345 


379 
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80 
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5.1 
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1.26 
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(2) The possibility of using longer probes in a bridging 
tiling, thereby increasing the specificity of the 
hybridization, without sacrificing discrimination, 

(3) The use of probes in which an interrogation position 
is located very off-center relative to the regions of target 
complementarity. This may be of particular advantage when, 
for example, when a probe centered about one region of the 
target gives low hybridization signal. The low signal is 
overcome by using a probe centered about an adjoining region 
giving a higher hybridization signal. 

(4) Disruption of secondary structure that might result 
in annealing of certain probes (see previous discussion of 
helper mutations). 

10. Deletion Tiling 

Deletion tiling is related to both the bridging and 
helper mutant strategies described above. In the deletion 
strategy, comparisons are performed between probes sharing a 
common deletion but differing from each other at an 
interrogation position located outside the deletion. For 
example, a first probe comprises first and second segments, 
each exactly complementary to respective first and second 
subsequences of a reference sequence, wherein the first and 
second subsequences of the reference sequence are separated by 
a short distance {e.g., 1 or 2 nucleotides). The order of the 
first and second segments in the probe is usually the same as 
that of the complement to the first and second subsequences in 
the reference sequence. The interrogation position is usually 
separated from The comparison is performed with three other 
probes, which are identical to the first probe except at an 
interrogation position, which is different in each probe. 
Reference:. . . AGTACCAGATCTCTAA . . . 

Probe set: CATGGNC AGAGA (N = interrogation position) . 

Such tilings sometimes offer superior discrimination in 
hybridization intensities between the probe having an 
interrogation position complementary to the target and other 
probes. Thermodynamically , the difference between the 
hybridizations to matched and mismatched targets for the probe 
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set shown above is the difference between a single-base bulge, 
and a large asymmetric loop (e.g., two bases of target, one of 
probe) . This often results in a larger difference in 
stability than the comparison of a perfectly matched probe 
with a probe showing a single base mismatch in the basic 
tiling strategy. 

The superior discrimination offered by deletion tiling is 
illustrated by Table 2, which compares hybridization data from 
a standard 10/5 tiling with a (4/8 + 6/3) deletion tiling of 
the reference sequence. (The numerators indicate the length 
of the segments and the denominators, the spacing of the 
deletion from the far termini of the segments.) Probe 
intensities (mean count/pixel) are displayed along with 
discrimination ratios (correct probe intensity / highest 
incorrect probe intensity) - Note that for each base displayed 
the deletion tiling has a higher discrimination value than 
either standard tiling shown. 

TABLE 2. Comparison of Standard and Deletion Tilings 
TILING PRO 



STANDARD 
(10/5) 

DISCRIMINATION: 

DELETION 
4/8 + 6/3 

DISCRIMINATION: 

STANDARD 
(10/7) 

DISCRIMINATION: 

The use of deletion or bridging probes is quite general. 
These probes can be used in any of the tiling strategies of 
the invention. As well as offering superior discrimination, 
the use of deletion or bridging strategies is advantageous for 
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BASE 
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certain probes to avoid self-hybridization (either within a 
probe or between two probes of the same sequence) 

c. Preparation of Target Samples 

The target polynucleotide, whose sequence is to be 
determined, is usually isolated from a tissue sample. If the 
target is genomic, the sample may be. from any tissue (except 
exclusively red blood cells). For example, whole blood, 
peripheral blood lymphocytes or PBMC, s)cin, hair or semen are 
convenient sources of clinical samples. These sources are 
also suitable if the target is RNA. Blood and other body 
fluids are also a convenient source for isolating viral 
nucleic acids. If the target is mRNA, the sample is obtained 
from a tissue in which the mRNA is expressed. If the 
polynucleotide in the sample is RNA, it is usually reverse 
transcribed to DNA. DNA samples or.cDNA resulting from 
reverse transcription are usually amplified, e.g., by PCR. 
Depending on the selection of primers and amplifying 
enzyme(s), the amplification product can be RNA or DNA. 
Paired primers are selected to flank the borders of a target 
polynucleotide of interest. More than one target can be 
simultaneously amplified by multiplex PCR in which multiple 
paired primers are employed. The target can be labelled at 
one or more nucleotides during or after amplification. For 
some target polynucleotides (depending on size of sample) , 
e.g., episomal DNA, sufficient DNA is present in the tissue 
sample to dispense with the amplification step. 

When the target strand is prepared in single-stranded 
form as in preparation of target RNA, the sense of the strand 
should of course be complementary to that of the probes on the 
chip. This is achieved by appropriate selection of primers. 
The target is preferably fragmented before application to the 
chip to reduce or eliminate the formation of secondary 
structures in the target. The average size of targets 
segments following hybridization is usually larger than the 
size of probe on the chip. 
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II*. ILLUSTRATIVE CHIPS 
A. HIV Chip 

HIV has infected a large and expanding number of people, 
resulting in massive hiealth care expenditures. HIV can 
5 rapidly become resistant to drugs used to treat the infection, 
primarily due to the action of the heterodimeric protein (51 
kDa and 66 kDa) HIV reverse transcriptase (RT) both subunits 
of which are encoded by the 1.7 kb pol gene. The high error 
rate (5-10 per round) of the RT protein is believed to account 

10 for the hypermutability of HIV. The nucleoside analogues, 
i.e., AZT, ddl, ddC, and d4T, commonly used to treat HIV 
infection are converted to nucleotide analogues by sequential 
phosphorylation in the cytoplasm of infected cells, where 
incorporation of the analogue into the viral DNA results in 

15 termination of viral replication, because the 5' -> 3/ 

phosphodiester linkage cannot be completed. However, after 
about 6 months to 1 year of treatment or less, HIV typically 
mutates the RT gene so as to become incapable of incorporating 
the analogue and so resistant to treatment. Several mutations 

20 known to be associated with drug resistance are shown in the 
table below. After a virus having drug resistance via a 
mutation becomes predominant, the patient suffers dramatically 
increased viral load, worsening symptoms (typically more 
frequent and dif f icult-to-treat infections) , and ultimately 

25 death. Switching to a different treatment regimen as soon as 
a resistant mutant virus takes hold may be an important step 
in patient management which prolongs patient life and reduces 
morbidity during life. 
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TABLE 3 

SOME RT MUTATIONS ASSOCIATED WITH DRUG RESISTANCE 



ANTIVIRAL 


CODON 


aa CHANGE 


nt CHANGE 


A 7T 


67 


Asp Asn 


GAC -> AAC 


AZT 


.70 


Lys -> Arg 


AAA -> AGA 


AZT 


215 


Thr -> Phe or Tyr 


ACC -> TTC or TAC 


AZT 


219 


Lys -> Gin or Glu 


AAA -> CAA or GAA 




41 


Met -> Leu 


ATG — > TTG or CTG 


AAT anri f^c^^* 
QQX cUlU UUv^ 


•X W "1 


Met "> Val 


ATG -> GTG 


-&f*tf^ AAr* 

aui ana aau 


/ H 


XJCU ^ vcix 




TIBO 82150 


ion 
lOU 


T an TT a 

jjeu xxe 




daC 


OD 


T XT e M ^ R ^ 

ijys ^ Asn 




daC 


a Q 
oy 


inxT ^ Asp 




J 1 L. 


XO H 


Met: — > Val 

l ie W ^ V C& X 




J 1 v.* 


X O *t 


Met — > Tie 


ATG — > ATA 


Pi.it L ^ dux 


Q c. 


Ala — > Val 


GPP — > GTP 

^ wXW 


A^l ^ uCLi. 




Val — > Tie 
V ax ^ xxw 


f^TA — > ATA 

V.3XXI ^ XlXX\ 


AZi + uul 


/ / 


"PHo T.eii 
IriiC ^ XjCU 


TTP — *> TTA 

X X ^ X X A 


AZT + adl 


XXD 




XXx XaX 


AZT + aal 


xD X 


nln — ^ Mo^ 


PAG — ^ ATG 


Nevaripine 


103 


Lys -> Asn 


AAA -> AAT 




106 


Val -> Ala 


GTA -> GCA 




108 








181 


Tyr -> Cys 


TAT -> TGT 




188 


Tyr -> His 


TAT -> CAT 




190 


Gly -> Ala 


GGA -> GCA 



10 



15 



20 



25 



30 



N.B.. Other mutations confer resistance to other drugs, 



A second important therapeutic target for anti-HIV drugs 
is the aspartyl protease enzyme encoded by the HIV genome, 
35 whose function is required for the formation of infectious 
progeny. See Robbins £e Plattner, J. Acguired Ijnjnune 
Deficiency Syndromes 6, 162-170 (1993); Kozal et al., Curr. 
Op. In/ect. Dis. 7:72-81 (1994). The protease function in 
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processing of viral precursor polypeptides to their active 
forms. Drugs targeted against this enzyme do not impair 
endogenous human proteases, thereby achieving a high degree of 
selective toxicity. Moreover, the protease is expressed later 
in the life-cycle that reverse transcriptase, thereby offering 
the possibility of a combined attack on HIV at two different 
times in its life-cycle. As for drugs targeted against the 
reverse transcriptase, administration of drugs to the protease 
can result in acquisition of drug resistance through mutation 
of the protease. By monitoring the protease gene from 
patients, it is possible to detect the occurrence of 
mutations, and thereby make appropriate adjustments in the 
drug(s) being administered. 

In addition to being infected with HIV, AIDS patients are 
often also infected with a wide variety of other infectious 
agents giving rise to a complex series of symptoms. Often 
diagnosis and treatment is difficult because many different 
pathogens (some life-threatening, others routine) cause 
similar symptoms. Some of these infections, so-called 
opportunistic infections, are caused by bacterial, fungal, 
protozoan or viral pathogens which are normally present in 
small quantity in the body, but are held in check by the 
immune system. When the immune system in AIDS patients fails, 
these normally latent pathogens can grow and generate rampant 
infection. In treating such patients, it would be desirable 
simultaneously to diagnose the presence or absence of a 
variety of the most lethal common infections, determine the 
most effective therapeutic regime against the HIV virus, and 
monitor the overall status of the patient*s infection. 

The present invention provides DNA chips for detecting 
the multiple mutations in HIV genes associated with resistance 
to different therapeutics. These DNA chips allow physicians 
to monitor mutations over time and to change therapeutics if 
resistance develops. Some chips also provide probes for 
diagnosis of pathogenic microorganisms that typically occur in 
AIDS patients. 

The sequence selected as a reference sequence can be from * 
anywhere in the HIV genome, but should preferably cover a 
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region of the HIV genome in which mutations associated with 
drug resistance are known to occur. A reference sequence is 
usually between about 5, 10, 20, 50, 100, 5000, 1000, 5,000 or 
10,000 bases in length, and preferably is about 100-1700 bases 
5 in length. Some reference sequences encompass at least part 
of the reverse transcriptase sequence encoded by the pol gene. 
Preferably, the reference sequence encompasses all, or 
substantially all (i.e, about 75 or 90%) of the reverse 
transcriptase gene. Reverse transcriptase is the target of 

10 several drugs and as noted, above, the coding sequence is the 
site of many mutations associated with drug resistance. In 
some chips, the reference sequence contains the entire region 
coding reverse transcriptase (850 bp) , and in other chips, 
subfragments thereof. In some chips, the reference sequence 

15 includes other subfragments of the pol gene encoding HIV 
protease or endonuclease, instead of, or as well as the 
segment encoding reverse transcriptase. In some chips, the 
reference sequence also includes other HIV genes such as env 
or gag as well as or instead of the reverse transcriptase 

20 gene. Certain regions of the gag and env genes are relatively 
well conserved, and their detection provides a means for 
identifying and quantifying the amount of HIV virus infecting 
a patient. In some chips, the reference sequence comprises an 
entire HIV genome. 

25 It is not critical from which strain of HIV the reference 

sequence is obtained. HIV strains are classified as HIV-I, 
HIV-II or HIV-III, and within these generic groupings there 
are several strains and polymorphic variants of each of these. 
BRU, SF2, HXB2, HXB2R are examples of HIV-1 strains, the 

30 sequences of which are available from GenBank. The reverse 
transcriptase genes of the BRU and SF2 strains differ at 23 
nucleotides. The HXB2 and HXB2R strains have the same reverse 
transcriptase gene sequence, which differs from that of the 
BRU strain at four nucleotides, and that of SF2 by 27 

35 nucleotides. In some chips, the reference sequence 

corresponds exactly to the reverse transcriptase sequence in 
the wildtype version of a strain. In other chips, the 
reference sequence corresponds to a consensus sequence of 



PCTAJS94/12305 

WO 95/11995 

66 

several HIV strains. In some chips, the reference sequence 
corresponds to a mutant form of a HIV strain. 

Chips are designed in accordance with the tiling 
strategies noted above. The probes are designed to be 
complementary to either the coding or noncoding strand of the 
HIV reference sequence. If only one strand is to be read, it 
is preferable to read the coding strand. The greater 
percentage of A residues in this strand relative to the 
noncoding strand generally result in fewer regions of 
ambiguous sequence. 

some chips contain additional probes or groups of probes 
designed to be complementary to a second reference sequence. 
The second reference sequence is often a subsequence of the 
first reference sequence bearing one or more commonly 
occurring HIV mutations or interstrain variations (e.g., 
within codons 67, 70, 215 or 219 of the reverse transcriptase 
gene) . The inclusion of a second group is particularly useful 
for analyzing short subsequences of the primary reference 
sequence in which multiple mutations are expected to occur 
within a short distance commensurate with the length of the 
probes (i.e., two or more mutations within 9 to 21 bases). 

The total number of probes on the chips depends on the 
tiling strategy, the length of the reference sequence and the 
options selected with respect to inclusion of multiple probe 
lengths and secondary groups of probes to provide confirmation 
of the existence of common mutations. To read much or all of 
the HIV reverse transcriptase gene (857 b for the BRU strain) , 
chips tiled by the basic strategy typically contain at least 
857 X 4 = 3428 probes. 

The target HIV polynucleotide,' whose sequence is to be 
determined, is usually isolated from blood samples (peripheral 
blood lymphocytes or PBMC) in the form of RNA. The RNA is 
reverse transcribed to DNA, and the DNA product is then 
amplified. Depending on the selection of primers and 
amplifying enzyme, the amplification product can be RNA or 
DNA. Suitable primers for amplification of target are shown 
in the table below. 
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TABLE 4 
AMPLIFICATION OF TARGET 



TARGET 
SIZE 


FORWARD PRIMER 


REVERSt PRJMER 


1.742 bp 


GTAGAATTCTGTTGACTCAGATTGG 


GATAAGCTTGGGCCTTATCTATTCCAT 


535 bp 


AAATCCATACAATACTCCAGTATTTGC 


ACCCATCCAAAGGAATGGAGGTTCTTTC 


323 bp 


Gcnbank* K02013 1889-1908 


bases 2211-2192 




AATTAACCCTCACTAAAGGGAga 
ggaagaatctgttgactcagattggt CRT^l-TS) 


AAITI AATACGACTCACTATAGGGAtncccca 
ciaacttctgtalgtcattgaca-3* (89-391 T7) 




AATTAACCCTCACTAAAGGGAga 
agtatactgcattaccauccugta fRTW-T3) 






TaaTacgactcactatagggaga 

icgacgcaggactcggcngctgaa (HV1-T2) 






AATTAACCCTCACTAAAGGGAGA 
ccttgtaagtcattggtcttaaaggta (HV2-T3) 





15 

In another aspect of the invention, chips are provided 
for simultaneous detection of HIV and microorganisms that 
commonly parasitize AIDS patients (e.g., cytomegalovirus 
(CMV) , Pneumocystis carini (PCP) , fungi (candida albicans) , 

20 mycobacteria) . Non-HIV viral pathogens aire detected and their 
drug resistance determined using a similar strategy as for 
HIV. That is groups of probes are designed to show 
complementarity to a target sequence from a region of the 
genome of a nonviral pathogen known to be associated with 

25 acquisition of drug resistance. For example, CMV and HSV 

viruses, which frequently co-parasitize AIDS patients, undergo 
mutations to acquire resistance to acyclovir. 

For detection of non-viral pathogens, the chips include 
an array of probes which allow full-sequence determination of 

30 16S ribosomal RNA or corresponding genomic DNA of the 

pathogens. The additional probes are designed by the same 
principles as described above except that the target sequence 
is a variable region from a 16S RNA (or corresponding DNA) of 
a pathogenic microorganism. Alternatively, the target 

35 sequence can be a consensus sequences of variable 16S rRNA 

regions from multiple organisms. 16S ribosomal DNA and RNA is 
present in all organisms (except viruses) and the sequence of 
the DNA or RNA is closely related to the evolutionary genetic 
distance between any two species. Hence, organisms which are 

40 quite close in type (e.g., all mycobacteria) share a common 
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region of 16S rDNA, and differ in other regions (variable 
regions) of the 16S rRNA- These differences can be exploited 
to allow identification of the different subtype strains. The 
full sequence of 16S ribosomal RNA or DNA read from the chip 
is compared against a database of the sequence of thousands of 
known pathogens to type unambiguously most nonviral pathogens 
infecting AIDS patients. 

In a further embodiment, the invention provides chips 
which also contain probes for detection of bacterial genes 
conferring antibiotic resistance. An antibiotic resistance 
gene can be detected by hybridization to a single probe 
employed in a reverse dot blot format. Alternatively, a group 
of probes can be designed according to the same principles 
discussed above to read all or part the DNA sequence encoding 
an antibiotic resistance gene. Analogous probes groups are 
designed for reading other antibiotic resistance gene 
sequences. Antibiotic resistance frequently resides in one of 
the following genes in microorganisms coparasitizing AIDS 
patients: rpoB (encoding RNA polymerase), katG (encoding 
catalase peroxidase, and DNA gyrase A and B genes. 

The inclusion of probes for combinations of tests on a 
single chip simulates the clinical diagnosis tree that a 
physician would follow based on the presentation of a given 
syndrome which could be caused by any number of possible 
pathogens. Such chips allow identification of the presence 
and titer of HIV in a patient, identification of the HIV 
strain type and drug resistance, identification of 
opportunistic pathogens, and identification of the drug 
resistance of such pathogens. Thus, the physician is 
siiaultaneously apprised of the full spectrum of pathogens 
infecting the patient and the most effective treatments 
therefor. 
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Exemplary HIV Chips 
(al HV 273 

The HV 273 chip contains an array of oligonucleotide 
probes for analysis of an 857 base HIV amplicon between 
nucleotides 2090 and 2946 (HIVBRU strain numbering). The chip 
contains four groups of probes: 11 mers, 13 mers, 15 mers and 
17 mers. From top to bottom, the HV 273 chip is occupied by 
rows of 11 mers, followed by rows of 13 mers, followed by rows 
of 15 mers followed by rows of 17 mers. The interrogation 
position is nucleotide 6, 7, 8 and 9 respectively in the 
different sized chips. This arrangement of the different 
sized probes is referred to as being "in series." Within each 
size group, there are four probe sets laid down in an A-lane, 
a C-lane a G-lane and a T-lane respectively. Each lane 
contains an overlapping series of probes with one probe for 
each nucleotide in the 2090-2946 HIV reverse transcriptase 
reference sequence, (i.e., 857 probes per lane) • The lanes 
also include ^ few column positions which are empty or 
occupied by control probes. These positions serve to orient 
the chip, determine background fluorescence and punctuate 
different subsequences within the target. The chip has an area 
of 1.28 X 1.28 cm, within which the probes form a 130 X 135 
matrix (17,550 cells total). The area occupied by each probe 
(i.e., a probe cell) is about 98 X 95 microns. 

The chip was tested for its capacity to sequence a 
reverse transcriptase fragment from the HIV strain SF2. An 
831 bp RNA fragment (designated pPoll9) spanning most of the 
HIV reverse transcriptase coding sequence was amplified by 
PCR, using primers tagged with T3 and T7 promoter sequences. 
The primers, designated RT#1-T3 and 89-391 T7 are shown in 
Table 4; see also Gingeras et al., J". Inf. Dis . 164, 1066-1074 
(1991) (incorporated by reference in its entirety for all 
purposes) . RNA was labelled by incorporation of fluorescent 
nucleotides. The RNA was fragmented by heating and hybridized 
to the chip for 40 min at 30 degrees. Hybridization signals 
were quantified by fluorescence imaging. 

Taking the best data from the four probes sets at each 
position in the target sequence, 715 out of 821 bases were 
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read correctly (87%) . (Comparisons are based on the sequence 
of pPoll9 determined by the conventional dideoxy method to be 
identical to SF2) . In general, the longer sized probes 
yielded more sequence than the shorter probes. Of the 21 
5 positions at which the SF2 and BRU strains diverged within the 
target, 19 were read correctly. 

Many of the short ambiguous regions in the target arise 
in segments of the target flanking the points at which the SF2 
and BRU sequences diverge. These ambiguities arise because in 

10 these regions the comparison of hybridization signals is not 
drawn between perfectly matched and single base mismatch 
probes but between a single-mismatched probe and three probes 
having two mismatches. These ambiguities in reading an SF2 
sequence would not detract from the chip's ability to read a 

15 BRU sequence either alone or in a mixture with an SF2 target 
sequence. 

In a variation of the above procedure, the chip was 
treated with RNase after hybridization of the pPoll9 target to 
the probes. Addition of RNase digests mismatched target and 

20 thereby increases the signal to noise ratio. RNase treatment 
increased the number of correctly read bases to 743/821 or 90% 
(combining the data from the four groups of probes) . 

In a further variation, the RNA target was replaced with 
a DNA target containing the same segment of the HIV genome. 

25 The DNA probe was prepared by linear amplification using Taq 
polymerase, RT#1-T3 primer, and fluorescein d-UTP label. The 
DNA probe was fragmented with uracil DNA glycosylase and heat 
treatment. The hybridization pattern across the array and 
percentage of readable sequence were similar to those obtained 

30 using an RNA target. However, there were a few regions of 

sequence that could be read from the RNA target that could not 
be read from the DNA target and vice versa. 

(h) HV 407 Chip 
35 The 407 chip was designed according to the same 

principles as the HV 273 chip, but differs in several 
respects. First, the oligonucleotide probes on this chip are 
designed to exhibit perfect sequence identity (with the 
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exception of the interrogation position on each probe) to the 
HIV strain SF2 (rather than the BRU strain as was the case for 
the HV 273 chip). Second, the 407 chip contains 13 mers, 15 
mers, 17 itiers and 19 mers (with interrogation positions at 
nucleotide 1, 8, 9 and 10 respectively), rather than the 11 
mers, 13 mers, 15 mers and 17 mers on the HV 273 chip. Third, 
the different sized groups of oligomers are arranged in 
parallel in place of the in-series arrangement on the HV 273 
chip. In the parallel arrangement, the chip contains from top 
to bottom a row of 13 mers, a row of 15 mers, a row of 17 
mers, a row of 19 mers, followed by a further row of 13 mers, 
a row of 15 mers, a row of 17 mers, a row of 19 mers, followed 
by a row of 13 mers, and so forth. Each row contains 4 lanes 
of probes, an A lane, a C lane, a G lane and a T lane, as 
described above. The probes in each lane tile across the 
reference sequence. The layout of probes on the HV 4 07 chip is 
shown in Fig. 10. 

The 407 chip was separately tested for its ability to 
sequence two targets, pPoll9 RNA and 4MUT18 RNA. pPoll9 
contains an 831 bp fragment from the SF2 reverse transcriptase 
gene which exhibits perfect complementarity to the probes on 
the 407 chip (except of course for the interrogation positions 
in three of the probes in each column) . 4MUT18 differs from 
the reference sequence at thirty-one positions within the 
target, including five positions in codons 67, 70, 215 and 219 
associated with acquisition of drug resistance. Target RNA 
was prepared, labelled and fragmented as described above and 
hybridized to the HV 407 chip. The hybridization pattern for 
the pPoll9 target is shown in Fig. 11. 

The sequences read off the chip for the pPoll9 and 4MUT18 
targets are both shown in Fig. 12 (although the two sequences 
were determined in different experiments) . The sequence 
labelled wildtype in the Figure is the reference sequence. 
The four lanes of sequence immediately below the reference 
sequence are the respective sequences read from the four^sized 
* groups of probes for the pPoll9 target (from top-to-bottom, 13 
mers, 15 mers, 17 mers and 19 mers). The next four lanes of 
sequence are the sequences read from the four-sized groups of 



wo 95/11995 PCT/US94/12305 

72 

probes for the 4MUT18 target (from top-to-bottom in the same 
order) . The regions of sequences shown in normal type are 
those that could be read unambiguously from the chip. Regions 
where sequence could not be accurately read are shown 
5 highlighted. Some regions of sequence that could not be read 
from one sized set of probes could be read from another. 

Taking the best result from the four sized groups of 
probes at each column position, about 97% of bases in the 
pPoll9 sequence and about 90% of bases in the 4MUT18 sequence 

10 were read accurately. Of the 31 nucleotide differences 

between 4MUT18 and the reference sequence, twenty-seven were 
read correctly including three of the nucleotide changes 
associated with acquisition of drug resistance. Of the 
ambiguous regions in the 4MUT18 sequence determination, most 

15 occurred in the 4MUT18 segments flanking points of divergence 
between the 4MUT18 and reference sequences. Notably, most of 
the common mutations in HIV reverse transcriptase associated 
with drug resistance (see Table 3) occur at sequence positions 
that can be read from the chip. Thus, most of the commonly 

20 occurring mutations can be detected by a chip containing an 
array of probes based on a single reference sequence. 

Comparison of the sequence read of the probes of 
different sizes is useful in determining the optimum size 
probe to use for different regions of the target. The 

25 strategy of customizing probe length within a single group of 
probe sets minimizes the total number of probes required to 
read a particular target sequence. This leaves ample capacity 
for the chip to include probes to other reference sequences 
(e.g., 16S KNA for pathogenic microorganisms) as discussed 

30 below. 

The HV 407 chip has also been tested for its capacity to 
detect mixtures of different HIV strains. The mixture 
comprises varying proportions of two target sequences; one a 
segment of a reverse transcriptase gene from a wildtype SF2 
35 strain, the other a corresponding segment from an SF2 strain 
bearing a codon 67 mutation. See Fig. 13. The Figure also 
represents the probes on the chip having an interrogation 
position for reading the nucleotide in which the mutation 
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occurs. A single probe in the Figure represents four probes 
on the chip with the symbol (o) indicating the interrogation 
position, which differs in each of the four probes. Figure 14 
shows the fluorescence intensity for the four 13 mers and the 
5 four 15 mers having an interrogation position for reading the 
nucleotide in the target sequence in which the mutation 
occurs. As the percentage of mutant target is increase, the 
fluorescence intensity of the probe exhibiting perfect 
complementarity to the wildtype target decreases, and the 

10 intensity of the probe exhibiting perfect complementarity to 
the mutant sequence increases. The intensities of the other 
two probes do not change appreciably. It is concluded that 
the chip can be used to analyze simultaneously a mixture of 
strains, and that a strain comprising as little as ten percent 

15 of a mixture can be easily detected. 

c. Protease Chip 

A protease chip was constructed using the basic tiling 
strategy. The chip comprises four probes tiling across a 382 

20 nucleotide span including 297 nucleotides from the protease 

coding sequence. The reference sequence was a consensus Clay- 
B HIV protease sequence. Different probes lengths were 
employed for tiling different regions of the reference 
sequence. Probe lengths were 11, 14, 17 and 20 nucleotides 

25 with interrogation positions at or adjacent to the center of 
each probe. Lengths were optimized from prior hybridization 
data employing a chip having multiple tilings, each with a 
different probe length. 

The chip was hybridized to four different single-stranded 

30 DNA protease target sequences (HXB2, SF2, NY5, pPol4mutl8) . 
Both sense and antisense strands were sequenced. Data from 
the chip was compared with that from an ABI sequencer. The 
overall accuracy from sequencing the four targets is 
illustrated in the Table 5 below. 
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Table 5 

ABI Protease Chip 

Sense Antisense Sense Antisense 

No call 0 4 9 4 

Ambiguous 6 14 17 8 

Wrong call 2 3 3 1 

TOTAL 8 21 29 13 



ABI (sense) - 99.5% 
Chip (sense) - 98.1% 

ABI (antisense) - 98.6% 
Chip (antisense) - 99.1% 

20 Combining the data from sense and antisense strands, both the 
chip and the ABI sequencer provided 100% accurate data for all 
of the sequence from all four clones. 

In a further test, the chip was hybridized to protease 
target sequences from viral isolates obtained from four 

25 patients before and after ddl treatment. The sequence read 
from the chip is shown in Fig. 15. Several mutations 
(indicated by arrows) have arisen in the samples obtained 
posttreatment . Particularly noteworthy was the chip's 
capacity to read a g/a mutation at nucleotide 207, 

3 0 notwithstanding the presence of two additional mutations (gt) 
at adjacent positions. 



B. Cystic Fibrosis Chips 

A number of years ago, cystic fibrosis, the most common 
35 severe autosomal recessive disorder in humans, was shown to be 
associated with mutations in a gene thereafter named the 
Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) 
gene. The CFTR gene is about 250 kb in size and has 27 exons. 
VJildtype genomic sequence is available for all exonic regions 
40 and exons/intron boundaries (Zielenski et al., Genomics 10, 
214-228 (1991) . The full-length wildtype cDNA sequence has 
also been described (see Riordan et al.. Science 245, 1059- 
1065 (1989) . Over 4 00 mutations have been mapped (see Tsui et . 
al, Hu. Wutat.* 1, 197-203 (1992). Many of the more common 
4 5 mutations are shown in Table 6. The most common cystic 
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fibrosis mutation is a three-base deletion resulting in the 
omission of amino acid #508 from the CFTR protein. The 
frequency of mutations varies widely in populations of 
different geographic or ethnic origin (see column 4 of 
Table 6) . About 90% of all mutations having phenotypic 
effects occur in coding regions. 

Detection of CFTR mutations is useful in a number of 
respects. For example, screening of populations can identify 
asymptomatic heterozygous individuals. Such individuals are 
at risk of giving rise to affected offspring suffering from CF 
if they reproduce with other such individuals. In utero 
screening of fetuses is also useful in identifying fetuses 
bearing 2 CFTR mutations. Identification of such mutations 
offers the possibility of abortion, or gene therapy. For 
couples known to be at risk of giving rise to affected 
progeny, diagnosis can be combined with in vitro reproduction 
procedures to identify an embryo having at least one wildtype 
CF allele before implantation. Screening children shortly 
after birth is also of value in identifying those having 
2 copies of the defective gene. Early detection allows 
administration of appropriate treatment (e.g., Pulmozyme 
Antibiotics, Pertussive Therapy) thereby improving the quality 
of life and perhaps prolonging the life expectancy of an 
individual. 

The source of target DNA for detecting of CFTR mutations 
is usually genomic. In adults, samples can conveniently be 
obtained from blood or mouthwash epithelial cells. In 
fetuses, samples can be obtained by several conventional 
techniques such as amniocentesis, chorionic villus sampling or 
fetal blood sampling. At birth, blood from the amniotic chord 
is a useful tissue source. 

The target DNA is usually amplified by PCR. Some 
appropriate pairs of primers for amplifying segments of DNA 
including the sites of known mutations are listed in Tables 5 
and 6. 
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SEQUENCE 




IBl 


TCTCCTTGGATATACTTGTGTGAATCAA 




788 


TCACCAGATTTCGTAGTCTTTTCATA 


5 


851 


GTCTTGTGTTGAAATTCTCAGGGTAT 




769 


CTTGTACCAGCTCACTACCTAAT 




887 


ACCTGAGAAGATAGTAAGCTAGATGAA 




888 


AACTCCGCCTTTCCAGTTGTAT 




934 


TTAGTTTCTAGGGGTGGAAGATACA 


10 


935 


TTAATGACACTGAAGATCACTGTTCTAT 




789 


CCATTCCAAGATCCCTGATATTTGAA 




790 


GCACATTTTTGCAAAGTTCATTAGA 




891 


TCATGGGCCATGTGCTTTTCAA 




892 


ACCTTCCAGCACTACAAACTAGAA 


15 


760 


CAAGTGAATCCTGAGCGTGATTT 




850 


GGTAGTGTGAAGGGTTCATATGCATA 




762 


GATTACATTAGAAGGAAGATGTGCCTTT 




763 


ACATGAATGACATTTACAGCAAATGCTT 




931 


GTGACCATATTGTAATGCATGTAGTGA 


20 


932 


ATGGTGAACATATTTCTCAAGAGGTAA 




955 


TGT CTC TGT AAA CTG ATG GCT AAC A 




884 


TCGTATAGAGTTGATTGGATTGAGAA 




885 


CCATTAACTTAATGTGGTCTCATCACAA 




886 


CTACCATAATGCTTGGGAGAAATGAA 


25 


782 


TCAAAGAATGGCACCAGTGTGAAA 




901 


TGCTTAGCTAAAGTTAATGAGTTCAT 
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OLIGO NUMBER 


SEQUENCE 


784 


AATTGTGAAATTGTCTGCCATTCTTAA 


785 


GATTCACTTACTGAACACAGTCTAACAA 


791 


AnGCTTCTCAGTGATCTGTTG 


792 




101 J 


GCCATGGTACCTATATGTCACAGAA 


1012 


TGCAGAGTAATATGAATTTCTTGAGTACA 


766 


GGGACTCCAAATATTGCTGTAGTAT 


1065 


GTACCTGTTGCTCCAGGTATGTT 



Other primers can be readily devised from the known 
genomic and cDNA sequences of CFTR. The selection of 
primers, of course, depends on the areas of the target 
sequence that are to be screened. The choice of primers also 
depends on the strand to be amplified. For some regions of 
the CFTR gene, it makes little difference to the hybridization 
signal whether the coding or noncoding strand is used. In 
other regions, one strand may give better discrimination in 
hybridization signals between matched and mismatched probes 
than the other. The upper limit in the length of a segment 
that can be amplified from one pair of PGR primers is about 50 
kb. Thus, for analysis of mutants through all or much of the 
CFTR gene, it is often desirable to amplify several segments 
from. several paired primers. The different segments may be 
amplified sequentially or simultaneously by multiplex PGR. 
Frequently, fifteen or more segments of the CFTR gene are 
simultaneously amplified by PGR. The primers and 
amplifications conditions are preferably selected to generate 
DNA targets. An asymmetric labelling strategy incorporating 
f luorescently labelled dNTPs for random labelling and dUTP^ for 
target fragmentation to an average length of less than 60 
bases is preferred. The use of dUTP and fragmentation with 
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uracil N-glycosylase has the added advantage of eliminating 
carry over between samples. 

Mutations in the CFTR gene can be detected by any of the 
tiling strategies noted above. The block tiling strategy is 
5 one particularly useful approach. In this strategy, a group 
(or block) of probes is used to analyze a short segment of 
contiguous nucleotides (e.g., 3, 5, 1 or 9) from a CFTR gene 
centered around the site of a mutation. The probes in a group 
are sometimes referred to as constituting a block because all 

10 probes in the group are usually identical except at their 

interrogation positions. As noted above, the probes may also 
differ in the presence of leading or trailing sequences 
flanking regions of complementary. However, for ease of 
illustration, it will be assumed that such sequences are not 

15 present. As an example, to analyze a segment of five 

contiguous nucleotides from the CFTR gene, including the site 
of a mutation (such as one of the mutations in Table 6) , a 
block of probes usually contains at least one wildtype probe 
and five sets of mutant probes, each having three probes. The 

20 wildtype probe has five interrogation positions corresponding 
to the five nucleotides being analyzed from the reference 
sequence. However, the identity of the interrogation 
positions is only apparent when the structure of the wildtype 
probe is compared with that of the probes in the five mutant 

25 probe sets. The first mutant probe set comprises three 

probes, each being identical to the wildtype probe, except in 
the first interrogation position, which differs in each of the 
three mutant probes and the wildtype probe. The second 
through fifth mutant probe sets are similarly composed except 

30 that the differences from the wildtype probe occur in the 
second through fifth interrogation position respectively. 
Note that in practice, each set of mutant probes is sometimes 
laid down on the chip juxtaposed with an associated wildtype 
probe. In this situation, a block would comprise five 

35 wildtype probes, each effectively providing the same 

information. However, visual inspection and confidence 
analysis of the chip is facilitated by the largely redundant 
information provided by five wildtype probes. 
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After hybridization to labelled target, the relative 
hybridization signals are read from the probes. Comparison of 
the intensities of the three probes in the first mutant probe 
set with that of the wildtype probe indicates the identity of 
the nucleotide in the target sequence corresponding to the 
first interrogation position. Comparison of the intensities 
of the three probes in the second mutant probe set with that 
of the wildtype probe indicates the identity of the nucleotide 
in the target sequence corresponding to the second 
interrogation position, and so forth. Collectively, the 
relative hybridization intensities indicate the identity of 
each of the five contiguous nucleotides in the reference 
sequence. 

In a preferred embodiment, a first group (or block) of 
probes is tiled based on a wildtype reference sequence and a 
second group is tiled based a mutant version of the wildtype 
reference sequence. The mutation can be a point mutation, 
insertion or deletion or any combination of these. The 
combination of first and second groups of probes facilitates 
analysis when multiple target sequences are simultaneously 
applied to the chip, as is the case when a patient being 
diagnosed is heterozygous for the CFTR allele. 

The above strategy is illustrated in Fig. 16, which shows 
two groups of probes tiled for a wildtype reference sequence 
and a point mutation thereof. The five mutant probe sets for 
the wildtype reference sequence are designated wtl-5, and the 
five mutant probe sets for the mutant reference sequence are 
designated ml-5. The letter N indicates the interrogation 
position, which shifts by one position in successive probe 
sets from the same group. The figure illustrates the 
hybridization pattern obtained when the chip is hybridized 
with a homozygous wildtype target sequence comprising 
nucleotides n-2 to n+2, where n is the site of a mutation. 
For the group of probes tiled based on the reference sequence, 
four probes are compared at each interrogation position. At 
each position, one of the four probes exhibits a perfect match 
with the target, and the other three exhibit a single-base 
mismatch. For the group of probes tiled based on the mutant 
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reference sequence, again four probes are compared at each 
interrogation position. At position, n, one probe exhibits a 
perfect match, and three probes exhibit a single base 
mismatch* Hybridization to a homozygous mutant yields an 
5 analogous pattern, except that the respective hybridization 

patterns of probes tiled on the wildtype and mutant reference 
sequences are reversed. 

The hybridization pattern is very different when the chip 
is hybridized with a sample from a patient who is heterozygous 

10 for the mutant allele (see Fig. 17) • For the group of probes 
tiled based on the wildtype sequence, at all positions but n, 
one probe exhibits a perfect match at each interrogation 
position, and the other three probes exhibit a one base 
mismatch. At position n, two probes exhibit a perfect match 

15 (one for each allele) , and the other probes exhibit single- 
base mismatches. For the group of probes tiled on the mutant 
sequence, the same result is obtained. Thus, the heterozygote 
point mutant is easily distinguished from both the homozygous 
wildtype and mutant forms by the identity of hybridization 

20 patterns from the two groups of probes. 

Typically, a chip comprises several paired groups of 
probes, each pair for detecting a particular mutation. For 
example, some chips contain 5, 10, 20, 4 0 or 100 paired groups 
of probes for detecting the corresponding numbers of 

25 mutations. Some chips are customized to include paired groups 
of probes for detecting all mutations common in particular 
populations (see Table 6) . Chips usually also contain control 
probes for verifying that correct amplification has occurred 
and that the target is properly labelled. 

30 The goal of the tiling strategy described above is to 

focus on short regions of the CTFR region flanking the sites 
of known mutation. Other tiling strategies analyze much 
larger regions of the CFTR gene, and are appropriate for 
locating and identifying hitherto uncharacterized mutations. 

35 For example, the entire genomic CFTR gene (250 kb) can be 

tiled by the basic tiling strategy from an array of about one 
million probes. Synthesis and scanning of such an array of 
probes is entirely feasible. Other tiling strategies, such as 
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the block tiling, multiplex tiling or pooling can cover the 
entire gene with fewer probes. Some tiling strategies analyze 
some or all of components of the CFTR gene, such as the cDNA 
coding sequence or individual exons. Analysis of exons 10 and 
11 is particularly informative because these are location of 
many common mutations including the AF508 mutation. 
Exemplary CFTR chips 

One illustrative chip bears an array of 1296 probes 
covering the full length of exon 10 of the CFTR gene arranged 
in a 36 X 36 array of 356 ^im elements. The probes in the 
array can have any length, preferably in the range of from 10 
to 18 residues and can be used to detect and sequence any 
single-base substitution and any deletion within the 192 -base 
exon, including the three-base deletion known as AF508. As 
described in detail below, hybridization of nanomolar 
concentrations of wild-type and AF508 oligonucleotide target 
nucleic acids labeled with fluorescein to these arrays 
produces highly specific signals (detected with confocal 
scanning fluorescence microscopy) that permit discrimination 
between mutant and wild-type target sequences in both 
homozygous and heterozygous cases. 

Sets of probes of a selected length in the range of from 
10 to 18 bases and complementary to subsequences of the known 
wild-type CFTR sequence are synthesized starting at a position 
a few bases into the intron on the 5 '-side of exon 10 and 
ending a few bases into the intron on the 3 '-side. There is a 
probe for each possible subsequence of the given segment of 
the gene, and the probes are organized into a "lane" in such a 
way that traversing the lane from the upper left-hand corner 
of the chip to the lower righthand corner corresponded to 
traversing the gene segment base-by-base from the 5 '-end. The 
lane containing that set of probes is, as noted above, called 
the "wild-type lane." 

Relative to the wild-type lane, a "substitution" lane, 
called the "A-lane", was synthesized on the chip. The A-lane 
probes were identical in sequence to an adjacent (immediately 
below the corresponding) wild-type probe but contained, 
regardless of the sequence of the wild-type probe, a dA 
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residue at position 7 (counting from the 3 '-end). In similar 
fashion, substitution lanes with replacement bases dC, dG, and 
dT were placed onto the chip in a "C-lane," a "G-lane," and a 
"T-lane," respectively. A sixth lane on the chip consisted of 
5 probes identical to those! in the wild-type lane but for the 
deletion of the base in position 7 and restoration of the 
original probe length by addition to the 5 '-end the base 
complementary to the gene at that position. 

The four substitution lanes enable one to deduce the 
10 sequence of a target axon 10 nucleic acid from the relative 
intensities with which the target hybridizes to the probes in 
the various lanes. Various versions of such exon 10 DNA chips 
were made as described above with probes 15 bases long, as 
well as chips with probes 10, 14, and 18 bases long. For the 
15 results described below, the probes were 15 bases long, and 
the position of substitution was 7 from the 3 '-end. 

The sequences of several important probes are shown 
below. In each case, the letter "X" stands for the 
interrogation position in a given col\imn set, so each of the 
20 sequences actually represents four probes, with A, C, G, and 
T, respectively, taking the place of the "X." Sets of shorter 
probes derived from the sets shown below by removing up to 
five bases from the 5 '-end of each probe and sets of longer 
probes made from this set by adding up to three bases from the 
25 exon 10 sequence to the 5»-end of each probe, are also useful 
and provided by the invention. 
3 ' -TTTATAXTAGAAACC 
3»- TTATAGXAGAAACCA 
3 ' - TATAGTXGAAACCAC 
30 3'- ATAGTAXAAACCACA 
3 ' - TAGTAGXAACCACAA 
• 3'- AGTAGAXACCACAAA 
3 • - GTAGAAXCCACAAAG 
3 " - TAGAAAXCACAAAGG 
35 3 • - AGAAACXACAAAGGA 



To demonstrate the ability of the chip to distinguish the 
AF508 mutation from the wild-type, two synthetic target 
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nucleic acids were made. The first, a 39-mer complementary to 
a subsequence of exon 10 of the CFTR gene having the three 
bases involved in the AF508 mutation near its center, is 
called the "wild-type" or wtSOB target, corresponds to 
5 positions 111-14 9 of the exon, and has the sequence shown 
below: 

5 ' -CATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGA. 

The second, a 3.6-mer probe derived from the wild-type target 
by removing those same three bases, is called the "mutant" 

10 target or mu508 target and has the sequence shown below, first 
with dashes to indicate the deleted bases, and then without 
dashes but with one base underlined (to indicate the base 
detected by the T-lane probe, as discussed below) : 
5 ' -CATTAAAGAAAATATCAT TGGTGTTTCCTATGATGA ; 

15 5 • -CATTAAAGAAAATATCATTGGTGTTTCCTATGATGA . 

Both targets were labeled with fluorescein at the 5 '-end. 

In three separate experiments, the wild-type target, the 
mutant target, and an equimolar mixture of both targets was 
exposed (0.1 nM wt508, 0.1 nM mu508, and 0.1 nM wtSOB plus 0.1 

20 nM mu508, respectively, in a solution compatible with nucleic 
acid hybridization) to a CF chip. The hybridization mixture 
was incubated overnight at room temperature, and then the chip 
was scanned on a reader (a confocal fluorescence microscope in 
photon-counting mode) ; images of the chip were constructed 

25 from the photon counts) at several successively higher 

temperatures while still in contact with the target solution. 
After each temperature change, the chip was allowed to 
equilibrate for approximately one-half hour before being 
scanned. After each set of scans, the chip was exposed to 

30 denaturing solvent and conditions to wash, i.e., remove target 
that had bound, the chip so that the next experiment could be 
done with a clean chip. 

The results of the experiments are shown in Figures 18, 
19, 20, and 21. Figure 18, in panels A, B, and C, shows an 

35 image made from the region of a DNA chip containing CFTR exon 
10 probes; in panel A, the chip was hybridized to a wild-type 
target; in panel C, the chip was hybridized to a mutant AF508 
target; and in panel B, the chip was hybridized to a mixture 
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of the wild-type and mutant targets. Figure 19, in sheets 1 - 
3, corresponding to panels A, B, and C of Figure 3, shows 
graphs of fluorescence intensity versus tiling position. The 
labels on the horizontal axis show the bases in the wild-type 
5 sequence corresponding to the position of substitution in the 
respective probes. Plotted are the intensities observed from 
the features (or synthesis sites) containing wild-type probes, 
the features containing the substitution probes that bound the 
most target ("called"), and the feature containing the 
10 substitution probes that bound the target with the second 
highest intensity of all the substitution probes ("2nd 
Highest") . 

These figures show that, for the wild-type target and the 
equimolar mixture of targets, the substitution probe with a 

15 nucleotide sequence identical to the corresponding wild-type 
probe bound the most target, allowing for an unambiguous 
assignment of target sequence as shown by letters near the 
points on the curve. The target wtSOB thus hybridized to the 
probes in the wild-type lane of the chip, although the 

20 strength of the hybridization varied from probe-to-probe, 
probably due to differences in melting temperature. The 
sequence of most of the target can thus be read directly from 
the chip, by inference from the pattern of hybridization in 
the' lanes of substitution probes (if the target hybridizes 

25 most intensely to the probe in the A-lane, then one infers 
that the target has a T in the position of substitution, and 
so on) . 

For the mutant target, the sequence could similarly be 
called on the 3 '-side of the deletion. However, the intensity 

30 of binding declined precipitously as the point of substitution 
approached the site of the deletion from the 3 '-end of the 
target, so that the binding intensity on the wild-type probe 
whose point of substitution corresponds to the T at the 3 '-end 
of the deletion was very close to background. Following that 

35 pattern, the wild-type probe whose point of substitution 
corresponds to the middle base (also a T) of the deletion 
bound still less target. However, the probe in the T-lane of 
that column set bound the target very well. Examination of 
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the sequences of the two targets reveals that the deletion 
places an A at that position when the sequences are aligned at 
their 3* -ends and that the T-lane probe is complementary to 
the mutant target with but two mismatches near an end (shown 
5 below in lower-case letters, with the position of substitution 
underlined) : 

Target : 5 • -CATTAAAGAAAATATCATTGGTGTTTCCTATGATGA 

Probe : 3 ' -TagTAGTAACCACAA 

Thus the T-lane probe in that column set calls the correct 

10 base from the mutant sequence. Note that, in the graph for 
the equimolar mixture of the two targets, that T-lane probe 
binds almost as much target as does the A- lane probe in the 
same column set, whereas in the other column sets, the probes 
that do not have wild-type sequence do not bind target at all 

15 as well. Thus, that one column set, and in particular the 

T-lane probe within that set, detects the AF508 mutation under 
conditions that simulate the homozygous case and also 
conditions that simulate the heterozygous case. 

Although in this example the sequence could not be 

20 reliably deduced near the ends of the target, where there is 
not enough overlap between target and probe to allow effective 
hybridization, and around the center of the target, where 
hybridization was weak for some other reason, perhaps high 
AT-content, the results show the method and the probes of the 

25 invention can be used to detect the mutation of interest. The 
mutant target gave a pattern of hybridization that was very 
similar to that of the wtSOB target at the ends, where the two 
share a common sequence, and very different in the middle, 
where the deletion is located. As one scans the image from 

30 right to left, the intensity of hybridization of the target to 
the probes in the wild-type lane drops off much more rapidly 
near the center of the image for mu508 than for wt508; in 
addition, there is one probe in the T-lane that hybridizes 
intensely with mu508 and hardly at all with wt508. The 

35 results from the equimolar mixture of the two targets, which 
represents the case one would encounter in testing a 
heterozygous individual for the mutation, are a blend of the 
results for the separate targets, showing the power of the 
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invention to distinguish a wild-type target sequence from one 
containing the AF508 mutation and to detect a mixture of the 
two sequences. 

The results above clearly demonstrate how the DNA chips 
of the invention can be used to detect a deletion mutation, 
AF508; another model system was used to show that the chips 
can also be used to detect a point mutation as well. One 
mutation in the CFTR gene is G480C, which involves the 
replacement of the G in position 4 6 of exon 10 by a T, 
resulting in the substitution of a cysteine for the glycine 
normally in position #480 of the CFTR protein. The model 
target sequences included the 21-mer probe wt480 to represent 
the wild-type sequence at positions 37-55 of exon 10: 
5 ' -CCTTCAGAGGGTAAAATTAAG and the 21-mer probe mu480 to 
represent the mutant sequence: 
5 ' -CCTTCAGAGTGTAAAATTAAG . 

In separate experiments, a DNA chip was hybridized to 
each of the targets wt480 and mu480, respectively, and then 
scanned with a confocal microscope. Figure 20, in panels A, 
B, and C, shows an image made from the region of a DNA chip 
containing CFTR exon 10 probes; in panel A, the chip was 
hybridized to the wt480 target; in panel C, the chip was 
hybridized to the mu480 target; and in panel B, the chip was 
hybridized to a mixture of the wild-type and mutant targets. 
Figure 21, in sheets 1-3, corresponding to panels A, B, and 
C of Figure 20, shows graphs of fluorescence intensity versus 
tiling position. The labels on the horizontal axis show the 
bases in the wild-type sequence corresponding to the position 
of substitution in the respective probes. Plotted are the 
intensities observed from the features (or synthesis sites) 
containing wild-type probes, the features containing the 
substitution probes that bound the most target ("called"), and 
the feature containing the substitution probes that bound the 
target with the second highest intensity of all the 
substitution probes ("2nd Highest"). 

These figures show that the chip could be used to 
sequence a 16-base stretch from the center of the target wt48 0 
and that discrimination against mismatches is quite good 
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throughout the sequenced region. When the DNA chip was 
exposed to the target inu480, only one probe in the portion of 
the chip shown bound the target well: the probe in the set of 
probes devoted to identifying the base at position 4 6 in exon 
5 10 and that has an A in the position of substitution and so is 
fully complementary to the central portion of the mutant 
target. All other probes in that region of the chip have at 
least one mismatch with the mutant target and therefore bind 
much less of it. In spite of that fact, the sequence of mu480 

10 for several positions to both sides of the mutation can be 

read from the chip, albeit with much-reduced intensities from 
those observed with the wild-type target. 

The results also show that, when the two targets were 
mixed together and exposed to the chip, the hybridization 

15 pattern observed was a combination of the other two patterns. 
The wild-type sequence could easily be read from the chip, but 
the probe that bound the mu480 target so well when only the 
mu480 target was present also bound it well when both the 
mutant and wild-type targets were present in a mixture, making 

20 the hybridization pattern easily distinguishable from that of 
the wild-type target alone. These results again show the 
power of the DNA chips of the invention to detect point 
mutations in both homo- and heterozygous individuals. 

To demonstrate clinical application of the DNA chips of 

25 the invention, the chips were used to study and detect 
mutations in nucleic acids from genomic samples. Genomic 
samples from a individual carrying only the wild-type gene and 
an individual heterozygous for AF508 were amplified by PGR 
using exon 10 primers containing the promoter for T7 RNA 

30 polymerase. Illustrative primers of the invention are shown 
below. 

Exon Name Sequence 

10 CFi9-T7 TAATACGACTCACTATAGGGAGatgacctaataatgatgggttt 

10 CFil0c-T7 TAATACGACTCACTATAGGGAGtagtgtgaagggttcatatgc 
35 10 CFilOc-T3 CTCGGAATTAACCCTCACTAAAGGtagtgtgaagggttcatatgc 

11 " CFilO-T7 TAATACGACTCACTATAGGGAGagcatactaaaagtgactctc 

11 CFillc-T7 TAATACGACTCACTATAGGGAGacatgaatgacatttacagcaa 
11 CFillc-T3 CGGAATTAACCCTCACTAAAGGacatgaatgacatttacagcaa 
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These primers can be used to amplify exon 10 or exon 11 
sequences; in another embodiment, multiplex PCR is employed, 
using two or more pairs of primers to amplify more than one 
exon at a time. 

5 The product of amplification was then used as a template 

for the RNA polymerase, with f luoresceinated UTP present to 
label the RNA product. After sufficient RNA was made, it was 
fragmented and applied to an exon 10 DNA chip for 15 minutes, 
after which the chip was washed with hybridization buffer and 

10 scanned with the fluorescence microscope. A useful positive 
control included on many CF exon 10 chips is the 8-mer 
3»-CGCCGCCG-5' . Figure 22, in panels A and B, shows an image 
made from a region of a DNA chip containing CFTR exon 10 
probes; in panel A, the chip was hybridized to nucleic acid 

15 derived from the genomic DNA of an individual with wild-type 
AF508 sequences; in panel B, the target nucleic acid 
originated from a heterozygous (with respect to the AF508 
mutation) individual. Figure 23, in sheets 1 and 2, 
corresponding to panels A and B of Figure 22, shows graphs of 

20 fluorescence intensity versus tiling position. 

These figures show that the sequence of the wild-type RNA 
can be called for most of the bases near the mutation. In the 
case of the AF508 heterozygous carrier, one particular probe, 
the same one that distinguished so clearly between the 

25 wild-type and mutant oligonucleotide targets in the model 

system described above, in the T-lane binds a large amount of 
RNA, while the same probe binds little RNA from the wild-type 
individual. These results show that the DNA chips of the 
invention are capable of detecting the AF508 mutation in a 

30 heterozygous carrier. 

Further chips were constructed using the block tiling 
strategy to provide an array of probes for analyzing a CFTR 
mutation. The array comprised 93 mm x 96 m^i features arranged 
into eleven columns and four rows (44 total probes) . Probes 

35 in five of these columns were from four probe sets tiled based 
on the wildtype CFTR sequence and having interrogation 
positions corresponding to the site of a mutation and two 
bases on either side. Five of the remaining columns contained 
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four sets of probes tiled based on the mutant version of the 
CFTR sequenced These probe sets also had interrogation 
positions corresponding to the site of mutation and two 
nucleotides on either side. The eleventh column contained 
four cells for control probes. 

Fluorescently labeled hybridization targets were prepared 
by PGR amplification. 100 /xg of genomic DNA, 0.4 /xM of each 
primer, 50 each dATP, dCTP, dCTP and dUTP (Pharmacia) n 
lOmM Tris-Cl, pH 8.3, 50 mM KCl, 2.5 mM MgCls and 2 U Tag 
polymerase (Perkin-Elmer) were cycled 36 times using a Perkin- 
Elmer 9600 thermocycler and the following times and 
temperatures: 95^C, 10 sec, 55^0, 10 sec, 72«C, 30 sec 10 
Hi of this reaction product was used as a template in a 
second, asymmetric PGR reaction. Conditions included l^M 
asymmetric PGR primer, 50 mM each dATP, dCTP, TTP, 25 /iM 
fluorescein-dGTP (DuPont) , 10 mM TrisrGl, pH 9.1, 75 mM KGl, 
3.5 mM MgGl2. The reaction was cycled 5X with the following 
conditions: 95*G, 10 sec, 60«G, 10 sec, 55''C, 1 min. and 72^G, 
1.5 min. This was immediately followed with another 20 cycles 
using the following conditions: 95*C, 10 sec, 60«C, 10 sec, 

72**G, 1.5 min. 

Amplification products were fragmented by treating 
with 2 U of Uracil-N-glycosylase (Gibco) at 30«*G for 3 0 min. 
followed by heat denaturation at 95*»G for 5 min. Finally, the 
labeled, fragmented PGR product was diluted into hybridization 
buffer made up of 5 X SSPE and 1 mM Getyltrimethylammonium 
Bromide (CTAB) . The dilution factor ranged from lOx to 25x 
with 40 Ml of sample being diluted into 0.4 ml to 1 ml of 
hybridization solution. 

Target hybridization was generally carried out with 
the chip shaking in a small dish containing 500 m1 to 1 ml 
total volume of hybridization solution. All hybridizations 
were done at 30^G constant temperature. Alternatively, some 
hybridizations were carried out with chips enclosed in a 
5 plastic package with the 1 cm x 1 cm chip glued facing a 250 
^1 -fluid chamber. 250-350 ^1 of hybridization solution was 
introduced and mixed using a syringe pump. Temperature was 
controlled by interfacing the back surface of the package with 
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a Peltier heating/cooling device. Following hybridization 
chips were washed with 5X SSPE, 0.1% Triton X-100 at 25°C-30«C 
prior to fluorescent image generation. 

Hybridized, washed DNA chips were scanned for 
5 fluorescence using a stage-scanning confocal epif luorescent 
microscope and 4 88nm argon ion laser excitation. Emitted 
light was collected through a band pass filter centered at 
530nM. The resulting fluorescence image was spatially 
reconstructed and intensity data were then analyzed. Features 

10 with the peak fluorescence intensity in each column were 
identified and compared with any signal intensity at the 
remaining single base mismatch probe sites in the same column. 
The sequences of the highest intensity features were then 
compared across all ten columns of each sub-array to determine 

15 whether peak intensity scores for the wild type sequence and 
the mutant sequence were similar or significantly different. 
These results were used to generate the genotype call of wild 
type (high intensity signals only in wild type probe columns) , 
mutant (high intensity signals only in the mutant probe 

20 columns) or heterozygous (high intensity signals in both the 
wild type and mutant probe columns) . 

Figure 24 (panel A) shows an image of the fluorescence 
signals in arrays designed to detect the G551D(G>A) and 
Q552X{C>T) CFTR mutations. The hybridization target is an 

25 exon 11 amplicon generated from wild type genomic DNA. Wild 
type hybridization patterns are evident at both locations. No 
significant fluorescence signal resulted at any of the . 
features with probes complementary to mutant or mismatched 
sequences. Relative fluorescence intensities were six fold 

30 brighter for the perfect matched wildtype features compared 
with the background signal intensity at mutant and mismatch 
features. In addition, the sequence at these loci can be 
confirmed as AGGTC and GTCAA, respectively, where the bold 
type face indicates the mutation sites. Figure 24 (panel B) 

35 shows the same probe array features after hybridization with a 
fluorescent target generated from DNA heterozygous for the 
G551D mutation. Both the wild type and mutant probe columns 
have features with significant fluorescence intensity, 
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indicating the hybridization of both wild type and mutant CFTR 
alleles at this site. Only wildtype probes hybridized with 
any significant fluorescence signal in the Q552X subarray 
indicating a wild type target sequence. However, an 
5 additional feature that did not hybridize in the first 

experiment shows significant fluorescence intensity in this 
experiment. Because the G551D and Q552X mutations are only 
two bases apart, the a probe sequence in the additional 
feature has a perfectly matched 12-mer overlap with the mutant 

10 G551D target. 

Figure 25 (panels A and B) illustrates mutation 
analysis for AF508, a three base pair deletion in Exon 10 of 
the CFTR gene. In contrast to the hybridization pattern seen 
in base change mutations, in mutations where bases are 

15 inserted or deleted, probe arrays show a different 

hybridization pattern. Identical probes are synthesized in 
the two central colximns of base substitution arrays. As a 
result, either mutant or wild type target hybridizations 
always result in two side-by-side features (a doublet) with 

20 high fluorescence intensity at the center of the array. In a 
heterozygote hybridization, two sets of doublets, one matched 
to the wild type sequence and one to the mutant sequence occur 
(Figure 24, panel B) . In contrast, wild type and mutant probe 
column sequences are offset from each other for deletion or 

25 insertion mutations and hybridization doublets are not seen. 
Instead of the. six high intensity signals with one doublet, 
five independent features in alternating columns characterize 
a homozygote and ten features, one in each colvmn will be 
positive with heterozygote targets. This is evident from the 

3 0 AF508 hybridization pattern in Figure 25, panel A. Although a 
wildtype target has been hybridized and the highest intensity 
features confirm the wild type sequence (ATCTT) , there is an 
additional hybridization in the first mutant column. Analysis 
of that probe sequence shows a 10 base perfect match with the 

35 mutant sequence - 

The image in Figure 25, panel B resulted from 
hybridizing a DNA chip with a target homozygous for AF508. In 
this image five features, all with probe sequences 
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complementary to the mutant show significant signal. The 
mutation sequence bridging the deletion site, ATTGG, is 
confirmed. Similar to what was seen in the example of the 
G551D mutation, there is added information in neighboring 
5 subarrays designed to detect the aI507 and F508C mutations. 
This is expected since they are in such close proximity to 
AF508 that their probe sets significantly overlap the AF508 
probes. The AF508 homozygous target has no perfect matches 
with wild type or mutant probes in the aI507 and F508C 

10 subarrays. However, there are some low intensity signals 
within these two blocks of probes. The F508C array has a 
doublet that matches 11 bases of the mutant AF508 target. 
Similarly, the hybridization in the eighth column of the aI507 
array has a probe that matches 13/14 bases with the target. 

15 Figure 2 6 shows hybridization of a heterozygous double 

mutant AF508/F508C to the same array as described above. 
Conventional reverse dot blot would score this sample as a 
homozygous AF508 mutant. In the present assays, the AF508 and 
F508C alleles are separately detected by the respective 

20 subarrays designed to detect these mutations. 

C. Chips for Cancer Diagnosis 

There are at least two types of genes which are often 
altered in cancerous cells. The first type of gene is an 

25 oncogene such as a mismatch-repair gene, and the second type 
of gene is a tumor suppressor gene such as a transcription 
factor. Examples of mismatch repair oncogenes genes include 
hMSH2 (Fishel et al. , Cell 75, 1027-1038 (1993)) and hMLHl 
(Papadopoulos et al., Science 263, 1625-1628 (1994)). The 

30 most well-known example of a tumor suppressor gene is the p53 
protein gene (Buchman et al.. Gene 70, 245-252 (1988). By 
monitoring the state of both oncogenes and tumor suppressor 
genes (individually and in combination) in a patient, it is 
possible to determine individual susceptibility to a cancer, a 

35 patient's prognosis upon cancer diagnosis, and to target 
therapy more efficiently. 

The p53 gene spans 20 kbp in humans and has 11 exons, 10 
of which are protein coding (see Tominaga et al., 1992, 
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Critical Reviews in Oncogenesis 3:257-282, incorporated herein 
by reference) . The gene produces a 53 kilodalton 
phosphoprotein that regulates DNA replication. The protein 
acts to halt replication at the Gl/S boundary in the cell 
5 cycle and is believed to act as a "molecular policeman," 

shutting down replication when the DNA is damaged or blocking 
the reproduction of DNA viruses (see Lane, 1992, Nature 
358:15-16, incorporated herein by reference). The p53 
transcription factor is part of a fundamental pathway which 

10 controls cell growth. Wild-type p53 can halt cell growth, or 
in some cases bring about programmed cell death (apoptosis) . 
Such tumor-suppressive effects are absent in a variety of 
known p53 gene mutations. Moreover, p53 mutants not only 
deprive a cell of wild-type p53 tumor suppression, they also 

15 may spur abnormal cell growth. 

In tumor cells, p53 is the most commonly mutated gene 
discovered to date (see Levine et al., 1991, Nature 
351:453-456, and Hollstein et al., 1991, Science 253:49-53, 
each of which is incorporated herein by reference) Over half 

20 of the 6.5 million patients diagnosed with cancer annually 
possess p53 mutations in their tumor cells. Among common 
tumors, about 70% of colorectal cancers, 50% of lung cancers 
and 40% of breast cancers contain p53 mutations. In all, over 
51 types of htaman tumors have been documented to possess p53 

25 mutations, including bladder, brain, breast, cervix, colon, 
esophagus, larynx, liver, lung, ovary, pancreas, prostate, 
skin, stomach, and thyroid tumors (Culotta & Koshland, Science 
262, 1958-1961 (1993); Rodrigues et al,, 1990, PNAS 
87:7555-7559, incorporated herein by reference). According to 

30 dat^ presented by David Sidransky (1992 San Diego Conference), 
over 400 mutations in p53 are known. The presence of a p53 
mutation in a tumor has also been correlated with a patient's 
prognosis. Patients who possess p53 mutations have a lower 5- 
year survival rate. 

35 Proper diagnosis of the form of p53 in tumor cells is 

critical to clinicians to prescribe appropriate therapeutic 
regimens. For instance, patients with breast cancer who show 
no invasion of nearby lymph nodes generally do not relapse 
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after standard surgical treatment and chemotherapy. Of the 
25% who do relapse after surgery and chemotherapy, additional 
chemotherapy is appropriate. At present, there is no clear 
way to determine which patients will benefit from such 
additional chemotherapy prior to relapse. However, 
correlating p53 mutations to tumorigenicity and metastasis 
provides clinicians with a means to determine whether such 
additional treatments are warranted. 

In addition to facilitating conventional chemotherapy, 
appropriate diagnosis of p53 mutations provides clinicians 
with the ability to identify individuals who will benefit the 
most from gene therapy techniques, in which appropriately 
operative p53 copies are restored to a tumor site. Clinical 
p53 gene therapy trials are presently underway (Culotta & 
Koshland, supra) . 

The analysis of p53 mutations can also be used to 
identify which carcinogens lead to particular tumors (Harris, 
Science 262, 1980-1981 (1993)). For instance, dietary 
aflatoxin exposure is associated with G:C to T:A 
transversions at residue 249 of p53 in hepatocellular 
carcinomas (Hsu et al. , Nature 350, 427 (1991); Bressac et 
al., Nature 350, 429 (1991); Harris, supra). 

While most described p53 mutations are somatic in origin, 
some types of cancer are associated with germline p53 
mutation. For instance, Li-Fraumeni syndrome is a hereditary 
condition in which individuals receive mutant p53 alleles, 
resulting in the early onset of various cancers (Harris, 
supra)} Frebourg et al., PNAS 89, 6413-6417 (1992); Malkin et 
al.. Science 250, 1233 (1990)): These mutations are 
associated with instability in the rest of the genome, 
creating multiple genetic alterations, and eventually leading 
to cancer. 

hMLHl and hMSH2 are mismatch repair genes which are 
causal agents in hereditary nonpolyposis colorectal cancer in 
individuals with mutant hMLHl or hMSH2 alleles (Fishel et al., 
supra, and Papadopoulos et al., supra). Hereditary 
nonpolyposis colorectal cancer is a common genetic disorders, 
affecting about 1 in 200 individuals (Lynch et al.. 
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Gastroenterology 104, 1535 (1993)). Detection of hMLHl and 
hMSH2 mutations in the population allows diagnosis of 
nonpolyposis colorectal cancer prone individuals prior to the 
manifestation of disease. This allows for the implementation 
5 of special screening programs for cancer-prone individuals to 
ensure early detection of cancer, thereby enhancing survival 
rates of afflicted individuals. In addition, genetic 
counselors may use the information derived from HMLHl and 
HMSH2 chips to improve family planning as described for cystic 

10 fibrosis chips. The detection of mutations in hMLHl and hMSH2 
individually or in combination with p53 can also be used by 
clinicians to assess cancer prognosis and treatment modality. 
Finally, the information can be used to target appropriate 
individuals for gene therapy. 

15 The entire WILHl gene is less than 85 kbp in length, 

comprising 2268 coding nucleotides (Papadopoulos et al., 
supra) . Sequences from the gene have been deposited with 
GenBank (accession number U07418) . Mutations associated with 
hereditary nonpolyposis colorectal cancer include the deletion 

20 of exon 5 (codons 578-632) , a 4 base pair deletion of codons 
727 and 728 resulting in a shift in the reading frame of the 
gene, a 4 base pair insertion at codons 755 and 756 resulting 
in an extension of the COOH terminus, a 371 base pair deletion 
and frameshift mutation at position 347, and a transversion 

25 causing an alteration of codon 252 resulting in the insertion 
of a stop codpn (id.)» 

hMSH2 is a hximan homologue of the bacterial MutS and 5. 
cerevisiae MSH mismatch-repair genes. MSH2, like hMLHl is 
associated with hereditary nonpolyposis cancer. Although only 

30 a few MSH2 gene samples from tumor tissue have been 

characterized, at least some tumor samples show a T to C 
transition mutation at position 2 020 of the cDNA sequence, 
resulting in the loss of an intron-exon splice acceptor site. 
In view of the role of mutations in p53, MSH2 and/or 

35 hMLHl in hereditary predisposition to cancer, to neoplastic 
transformation events leading to cancer and to cancer 
prognosis, it is important to screen individuals to determine 
whether they possess mutant alleles, and to identify precisely 
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Which mutations the individuals possess. Because many 
mutations are point mutations, or extremely small insertions 
or deletions, which are generally undetectable by standard 
Southern analysis, accurate diagnosis requires a capacity to 
5 examine a gene nucleotide-by-nucleotide. 

Mutations in the hMSH2, hMLHl or p53 genes, irrespective 
of whether previously characterized, can be detected by any of 
the tiling strategies noted above • Reference sequences of 
interest include full-length genomic and cDNA sequences of 

10 each of these genes and subsequences thereof, such as exons 
and introns. For example, ec^ch nucleotide in the 20 kb p53 
genomic sequence can be tiled using the basic strategy with an 
array of about 80,000 probes. As in the CFTR chip, some 
reference sequences are comparatively short sequences 

15 including the site of a known mutation and a few flanking 
nucleotides. Some chips tile reference sequences that 
encompass mutational "hot spots." For instance, a variety of 
cellular and oncoviral proteins bind to specific regions of 
p53, including Mdm2, SV40 T antigen, Elb from adenovirus and 

20 E6 from human papilloma virus. These binding sites correlate 
to some extent with observed high frequency somatic mutation 
regions of p53 found in tumor cells from cancer patients (see 
Harris et al., supra). Hot spots include exons 2, 3, 5, 6, 7 
and 8 and the intronic regions between exons 2 and 3, 3 and 4 

25 and 4 and 5. Fragments of the hMLHl gene of particular 

interest include those encoding codons 578--632, 727, 728, 347, 
252. Some chips are tiled to read mutations in each of the 
hMSH2, hMLHl and p53 genes, both wildtype and mutant versions. 
Standard or asymmetric PGR can be used to generate the 

30 target DNA used in the tiling assays described above. In 

general, PGR is used to amplify hMSH2, hMLHl or p53 sequences 
from a tissue of interest such as a tumor. Mixed PGR 
reactions can also be used to generate hMSH2, hMLHl or p53 
sequences simultaneously in a single reaction mixture. Any of 

35 the coding or noncoding sequences from the genes may be 

amplified for use in the block tiling assays described above. 

Table 8 below provides examples of primers which are 
useful in synthesizing specific regions of hMSH2, hMHLHl and 
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p53.' Other primers can readily be devised from the known 
genomic and cDNA sequences of the genes. The primers 
described in Table 8 specific for p53 amplification have ends 
tailored to facilitate cloning into standard restriction 
5 enzyme cloning sites. 

Table 8: Examples of PCR primers useful in amplifying regions of p53, hMHHl and 
hMSH2. 



10 



15 



20 



25 



Region 
Amplified 

Exon 5 
(p53) 

Exon 5 
(p53) 

Exon 6 
(p53) 

Exon 6 
(p53) 

Exon 7 
(p53) 

Exon 7 
(p53) 

Exon 8 
(p53) 

Exon 8 
(p53) 

hMSH2 



hMSH2 
hMLHI 



Primer Sequence 



TAA TAC GAC TCA CTA TAG GGA GA CCC 
TGG GCA ACC AGC CCT GTC GT 

ATG CAA TTA ACC CTC ACT AAA GGG 
AGA CAC TTG TGC CCT GAC TTT CAA C 

TAA TAC GAC TCA CTA TAG GGA GCC 
TCC TCC CAG AGA CCC 

ATG CAA TTA ACC CTC ACT A A GGG AGA 
TCC CCA GGC CTC TGA TTC CTC ACT G 

TAA TAC GAC TCA CTA TAG GGA CTG 
GGG CAC AGC CAG GCC AGT GTG CA 

ATG CAA TTA ACC CTC ACT AAA GGG 
AGA GTC TCC CCA AGG CGC ACT GGC 
CTC A 

TAA TAC GAC TCA CTA TAG GGA GGG 
CAT AAC TGC ACC CTT GGT CTC CTC C 

ATG CAA TTA ACC CTC ACT AAA GGG 
AGA GGA CCT GAT TTC CTT ACT GCC TCT 
TGC 

GAC ATG GCG GTG CAG CCG AAG GAG A 



CTA TGT CAA TTG CAA ACA GTG CTC AGT 
TAC AG 

CTT GGC TCT TCT GGC GCC AAA ATG TCG 
TTC 



Description 

Exon 5 T7 Primer (5' T7 
to p53 3'). 

Exon 5 T3 Primer (5' T3 
to p53 3'). 

Exon 6 T7 Primer (5'T7 
to p53 3'). 

Exon 6 T3 Primer {5'T3 
to p53 3'). 

Exon 7 T7 Primer (5' T7 
to p53 3'). 

Exon 7 T3 Primer (5' T3 
to p53 3'). 



Exon 8 T7 Primer (5' T7 
to p53 3'). 

Exon 8 T3 Primer (5' T3 
to p53 3'). 

Primer for MSH2, 5' to 
3'. If used with MSH2 
primer below, a 3033 
base pair amplicon will 
result 

Primer for hMSH2 5'to 
3'. 

Primer for hMLHI, 5'to 
3'. If used with hMLHI 
primer below, a 2484 
base pair amplicon will 
result. 



30 



hMLHI 



TAT GTT AAG ACA CAT CTA TTT ATT TAT Primer for hMLHI 5' to 
A AT CAA TCC 3'. 
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After PCR amplification of the target amplicon one strand 
of the amplicon can be isolated, i.e., using a biotinylated 
primer that allows capture of the undesired strand on 
streptavidin beads. Alternatively, asymmetric PCR can be used 
5 to generate a single-stranded target. Another approach 

involves the generation of single stranded RNA from the PCR 
product by incorporating a T7 or other RNA polymerase promoter 
in one of the primers. The single-stranded material can 
optionally be fragmented to generate smaller nucleic acids 
10 with less significant secondary structure than longer nucleic 
acids. 

In one such method, fragmentation is combined with 
labeling. To illustrate, degenerate 8-mers or other 
degenerate short oligonucleotides are hybridized to the 
15 single-stranded target material. In the next step, a DNA 
polymerase is added with the four different 

dideoxynucleotides, each labeled with a different fluorophore. 
Fluorophore-labeled dideoxynucleotide are available from a 
variety of commercial suppliers. Hybridized 8-mers are 

20 extended by a labeled dideoxynucleotide. After an optional 
pxirif ication step, i.e., with a size exclusion column, the 
labeled 9-mers are hybridized to the chip. Other methods of 
target fragmentation can be employed. The single-stranded DNA 
can be fragmented by partial degradation with a DNAse or 

25 partial depurination with acid. Labeling can be accomplished 
in a separate step, i.e., fluorophore-labeled nucleotides are 
incorporated before the fragmentation step or a DNA binding 
fluorophore, such as ethidium homodimer, is attached to the 
target after fragmentation. 

30 

Exemplary Chips 

a. Exon VI Chip 

To illustrate the value of the DNA chips of the present 
invention in such a method, a DNA chip was synthesized by the 
35 VLSIPS™ method to provide an array of overlapping probes which 
represent or tile across a 60 base region of exon 6 of the p53 
gene. To demonstrate the ability to detect substitution 
mutations in the target, twelve different single substitution 
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mutations (wild type and three different substitutions at each 
of three positions) were represented on the chip along with 
the wild type. Each of these mutations was represented by a 
series of twelve 12-mer oligonucleotide probes, which were 

5 complementary to the wild type target except at the one 

substituted base. Each of the twelve probes was complementary 
to a different region of the target and contained the mutated 
base at a different position, e.g., if the substitution was at 
base 22, the set of probes would be complementary — with the 

.0 exception of base 32 — to regions of the target 21-32, 22-33, 
and 32-43). This enabled investigation of the effect of the 
substitution position within the probe. The alignment of some 
of the probes with a 12-mer model target nucleic acid is shown 
in Figure 27. 

.5 To demonstrate the effect of probe length, an additional 

series of ten 10-mer probes was included for each mutation 
(see Figure 28). In the vicinity of the substituted 
positions, the wild-type sequence was represented by every 
possible overlapping 12-mer and 10-mer probe. To simplify 

>0 comparisons, the probes corresponding to each varied position 
were arranged on the chip in the rectangular regions with the 
following structure: each row of cells represents one 
substitution, with the top row representing the wild type. 
Each colximn contains probes complementary to the same region 

25 of the target, with probes complementary to the 3 '-end of the 
target on the left and probes complementary to the 5 '-end of 
the target on the right. The difference between two adjacent 
columns is a single base shift in the positioning of the 
probes. Whenever possible, the series of 10-mer probes were 

30 placed in four rows immediately underneath and aligned with 
the 4 rows of 12-mer probes for the same mutation. 

To provide model targets, 5* f luoresceinated 12-mers 
containing all possible substitutions in the first position pf 
codon 192 were synthesized (see the starred position in the 

35 target in Figure 27). Solutions containing 10 nM target DNA 
in'6X SSPE, 0.25% Triton X-100 were hybridized to the chip at 
room temperature for several hours. While target nucleic was 
hybridized to the chip, the fluorophores on the chip were 
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excited by light from an argon laser, and the chip was scanned 
with an autofocusing confocal microscope. The emitted signals 
were processed by a PC to produce an image using image 
analysis software. By 1 to 3 hours, the signal had reached a 
5 plateau; to remove the hybridized target and allow 

hybridization to another target, the chip was stripped with 
60% formamide, 2 X SSPE at 17 'C for 5 minutes. The washing 
buffer and temperature can vary, but the buffer typically 
contains 2-to-3X SSPE, lO-to-60% formamide (one can use 
10 multiple washes, increasing the formamide concentration by 10% 
each wash, and scanning between washes to determine when the 
wash is complete) , and optionally a small percentage of Triton 
X-100, and the temperature is typically in the range of 
15-to-18*C 

15 Very distinct patterns were observed after hybridization 

with targets with 1 base substitutions and visualization with 
a confocal microscope and software analysis, as shown in 
Figure 29. In general, the probes which form perfect matches 
with the target retain the highest signal. For example, in 

20 the first image, the 12-mer probes that form perfect matches 
with the wild-type (WT) target are in the first row (top) . 
The 12-mer probes with single base mismatches are located in 
the second, third, and fourth rows and have much lower 
signals. The data is also depicted graphically in Figure 30. 

25 On each graph, the X ordinate is the position of the probe in 
its row on the chip, and the Y ordinate is the signal at that 
probe site after hybridization. When a target with a 
different one base substitution is hybridized the 
complementary set of probes has the highest signal (see 

30 pictures 2, 3, and 4 in Figure 29 and graphs 2, 3, and 4 in 
Figure 30) . In each case, the probe set with no mismatches 
with the target has the highest signals. Within a 12-iner 
probe set, the signal was highest at position 6 or 7. The 
graphs show that the signal difference between 12-mer probes 
35 at the same X ordinate tended to be greatest at positions 5 
and 8 when the target and the complementary probes formed 10 
base pairs and 11 base pairs, respectively. Because tumors 
often have both WT and mutant p53 genes, mixed target 
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populations were also hybridized to the chip, as shown in 
Figure 31. When the hybridization solution consisted of a 1:1 
mixture of WT 12-iner and a 12-iner with a substitution in 
position 7 of the target, the sets of probes that were 
5 perfectly matched to both targets showed higher signals than 
the other probe sets. 

The hybridization efficiency of a lO-mer probe array as 
compared to a 12-mer probe array was also compared. The 
10-mer and 12-mer probe arrays gave comparable signals (see 

10 graphs 1-4 in Figure 30 and graphs 1-4 in Figure 32) . 

However, the 10-mer probe sets, which are in rows 5-8 (see 
images in Figure 29) , seemed to be better in this model system 
than the 12-mer probe sets at resolving one target from 
another, consistent with the expectation that one base 

15 mismatches are more destabilizing for IQ-mers than 12-mers. 
Hybridization results within probe sets perfectly matched to 
target also followed the expectation that, the more matches 
the individual probe formed with the target, the higher the 
signal. However, duplexes with two 3' dangles (see Figure 30, 

20 position 6 in graphs 1-4) have about as much signal as the 

probes which are matched along their entire length (see Figure 
30, position 7, in graphs 1-4). 

This illustrative model system shows that 12-mer targets 
that differ by one base substitutions can be readily 

25 distinguished from one another by the novel probe array 

provided by the invention and that resolution of the different 
12-mer targets was somewhat better with the 10-mer probe sets 
than with the 12-mer probe sets. 
b. Exon V Chip 

30 To analyze DNA from exon 5 of the p53 tumor suppressor 

gene, a set of overlapping 17-mer probes was synthesized on a 
chip. The probes for the WT allele were synthesized so as to 
tile across the entire exon with single base overlaps between 
probes. For each WT probe, a sets of 4 additional probes, one 

35 for each possible base substitution at position 7, were 

synthesized and placed in a column relative to the WT probe. 
Exon 5 DNA was amplified by PGR with primers flanking the 
exon. One of the primers was labeled with fluorescein; the 
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other primer was labeled with biotin. After amplification, 
the biotinylated strand was removed by binding to streptavidin 
beads. The f luoresceinated strand was used in hybridization. 

5 About 1/3 of the amplified, single-stranded nucleic acid 

was hybridized overnight in 5 X SSPE at 60 to the probe chip 
(under a cover slip) . After washing with 6 X SSPE, the chip 
was scanned using confocal microscopy. Figure 3 3 shows an 
image of the p53 chip hybridized to the target DNA. Analysis 

10 of the intensity data showed that 93.5% of the 184 bases of 
exon 5 were called in agreement with the WT sequence (see 
Buchman et al. , 1988, Gene 70: 245-252, incorporated herein by 
reference) . The miscalled bases were from positions where 
probe signal intensities were tied (1.6%) and where non-WT 

15 probes had the highest signal intensity (4.9%), Figure 34 
illustrates how the actual sequence was read. Gaps in the 
sequence of letters in the WT rows correspond to control 
probes or sites. Positions at which bases are miscalled are 
represented by letters in italic type in cells corresponding 

20 to probes in which the WT bases have been substituted by other 
bases. 

As the diagram indicates, the miscalled bases are from 
the low intensity areas of the image, which may be due to 
secondary structure in the target or probes preventing 

25 intermolecular hybridization. To diminish the effects due to 
secondary structure, one can employ shorter targets (i.e., by 
target fragmentation) or use more stringent hybridization 
conditions. In addition, the use of a set of probes 
synthesized by tiling across the other strand of a duplex 

30 target can also provide sequence information buried in 
secondary structure in the other strand. It should be 
appreciated, however, that the pattern of low intensity areas 
that forms as a result of secondary structure in the target 
itself provides a means to identify that a specific target 

35 sequence is present in a sample. Other factors that may 

contribute to lower signal intensities include differences in 
probe densities and hybridization stabilities. 
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These results demonstrate the advantages provided by the 
DNA chips of the invention to genetic analysis. As another 
example, heterozygous mutations are currently sequenced by an 
arduous process involving cloning and repurif ication of DNA. 
5 The cloning step is required, because the gel sequencing 
systems are poor at resolving even a 1:1 mixture of DNA. 
First, the target DNA is amplified by PGR with primers 
allowing easy ligation into a vector, which is taken up by 
transforation of E. coli . which in turn must be cultured, 

10 typically on plates overnight. After growth of the bacteria, 
DNA is purified in a procedure that typically takes about 2 
hours; then, the sequencing reactions are performed, which 
takes at least another hour, and the samples are run on the 
gel for several hours, the duration depending on the length of 

15 the fragment to be sequenced. By contrast, the present 
invention provides direct analysis of the PGR amplified 
material after brief transcription and fragmentation steps, 
saving days of time and labor. 

20 D> Mitochondrial Genome Ghips 

A human cell may have several hundred mitochondria, each 
with more than one copy of mtDNA. There is strand asymmetry 
in the base compositions, with one strand (Heavy) being 
relatively G rich, and the other strand (Light) being C rich. 

25 The L strand is 30.9% A, 31.2% C, 13.1% G, and 24.7% T. Human 
mtDNA is information-rich, encoding some 22 tRNAs, 12S and 16S 
rRNAs, and 13 polypeptides involved in oxidative 
phosphorylation. No introns have been detected. RNAs are 
processed by cleavage at tRNA sequences, and polyadenylated 

30 po^transcriptionally. In some transcripts, polyadenylation 
also creates the stop codon, illustrating the parsimony of 
coding. In many individuals, mtDNA can be treated as haploid. 
However, some individuals are heteroplasmic (have more than 
one mtDNA sequence) , and the degree of heteroplasmy can vary 

35 from tissue to tissue. Also, the rate of replication of 

mtDNAs can differ and together with random segregation during 
cell division, can lead to changes in heteroplasmy over time. 
The human mitochondrial genome is 16,569 nucleotides 



PCTAJS94/12305 

WO 95/11995 

104 

long. The sequence of the L-strand is numbered arbitrarily 
from the MboI-5/7 boundary in the D-loop region. The complete 
sequence of the human mitochondrial genome has been published. 
See Anderson et al.. Nature 290, 457-465 (1981). 
5 Mitochondrial DNA is maternally inherited, and has a mutation 
rate estimated to be tenfold higher than single copy nuclear 
DNA (Brown et al., Proc. Natl. Acad. Sci . USA 76, 1967-1971 
(1979)). Human mtDNAs differ, on average, by about 70 base 
substitutions (Wallace, Ann. -Rev. Biochejn. 61, 1175-1212 

10 (1992)). Over 80% of substitutions are transitions (i.e., 
pyrimidine-pyrimidine or purine-purine) , 

Analysis of mitochondrial DNA serves several purposes. 
Detection of mutations in the mitochondrial genome allows 
diagnosis of a number of diseases. The mitochondrial genome 

15 has been identified as the locus of several mutations 

associated with human diseases. Some of the mutations result 
in stop codons in structural genes. Such mutations have been 
mapped and associated with diseases, such as Leber's 
hereditary optic neuropathy, neurogenic muscular weakness, 

20 ataxia and retinitis pigmentosa. Other mutations (nucleotide 
substitutions) occur in tRNA coding sequences, and presumably 
cause conformational defects in transcribed tRNA molecules. 
Such mutations have also been mapped and associated with 
diseases such as Myoclonic Epilepsy and Ragged Red Fiber 

25 Disease. Another type of mutation commonly found is deletions 
and/ or insertions. Some deletions span segments of several 
kb. Again, such mutations have been mapped and associated 
with diseases, for example, ocular myopathy and Person 
Syndrome. See Wallace, Ann. Rev. Biochem. 61-1175-1212 (1992) 

3 0 (incorporated by reference in its entirety for all purposes) . 
Early detection of such diseases allows metabolic or genetic 
therapy to be administered before irretrievable damage has 
occurred. Id. Analysis of mitochondrial DNA is also 
important for forensic screening. Because the mitochondrial 

35 genome is a locus of high variability between individuals, 

sequencing a substantial length of mitochondrial DNA provides 
a fingerprint that is highly specific to an individual. 
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Analysis of mitochondrial DNA is also important for 
evolutionary and epidemiological studies. 

The reference sequence can be an entire mitochondrial 
genome or any fragment thereof. For forensic and 
5 epidemiological studies, the reference sequence is often all 
or part of the D-loop region in which variability between 
individuals is greatest (e.g., from 16024-16401 and 29-408). 
For detection of mutations, analysis of the entire genome is 
useful as a reference sequence, but shorter segments including 

10 the sites of known mutations, and about 1-20 flanking bases 
are also useful. Some chips have probes tiling paired 
reference sequences, representing wildtype and mutant versions 
of a sequence. Tiling a second reference sequence is 
particularly useful for detecting an insertion mutation 

15 occurring in 30-50% of ocular myopathy and Pearson syndrome 
patients, which consists of direct repeats of the sequence 
ACCTCCCTCACCA. Some chips include reference sequences from 
more than one mitochondrial genome. 

Mitochondrial reference sequences can be tiled using any 

20 of the strategies noted above. The block tiling strategy is 
particularly useful for analyzing short reference sequences or 
known mutations. Either the block strategy or the basic 
strategy is suitable for analyzing long reference sequences. 
In many of the tiling strategies, it is possible to use fewer 

25 probes compared with the number used in other chips without 
significant loss of sequence information. As noted above, 
most* point mutations in mitochondrial DNA are transitions, so 
for each wildtype nucleotide in a reference sequence, one of 
the three possible nucleotide substitutions is much more 

30 likely than the other two. Accordingly, in the basic tiling 
strategy, for example, a reference sequence can be tiled using 
only two probe sets. One probe sets comprises a plurality of 
probes, each probe having a segment exactly complementary to 
the reference sequence. The second probe set comprises a 

35 corresponding probe for each probe in the first set. However, 
a probe from the second probe set differs from the 
corresponding probe from the first probe set in an 
interrogation position, in which the probe from the second 
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probe set includes the transition of the nucleotide present in 
that position in the probe from the first probe set. 

Target mitochondrial DNA can be amplified, labelled and 
fragmented prior to hybridization using the same procedures as 
5 described for other chips. Use of at least two labelled 

nucleotides is desirable to achieve uniform labelling. Some 
exemplary primers are described below and other primers can be 
designed from the known sequence of mitochondrial DNA. 
Because mitochondrial DNA is present in multiple copies per 
10 cell, it can also be hybridized directly to a chip without 
prior amplification. 



Fvpm plarv Chips 

The invention provides a DNA chip for analyzing sequences 

15 contained in a 1.3 kb fragment of human mitochondrial DNA from 
the "D-loop" region, the most polymorphic region of human 
mitochondrial DNA. One such chip comprises a set of 269 
overlapping oligonucleotide probes of varying length in the 
range of 9-14 nucleotides with varying overlaps arranged in 

20 "600 X 600 micron features or synthesis sites in an array 1 cm 
X 1 cm in size. The probes on the chip are shown in columnar 
form below. An illustrative mitochondrial DNA chip of the 
invention comprises the following probes (X, Y coordinates are 
shown, followed by the sequence; "DL3" represents the 3»-end 

25 of the probe, which is covalently attached to the chip 





surface . ) 












0 


0 


DL3AGTGGGGTATTT 


1 


1 


DL3 GGTTGGTTTGGG 






1 


0 


DL3 GGGTATTTAGTT 


2 


1 


DL3 TGGGGTTTCTAG 






2 


0 


DL3 TTAGTTTATCCAA 


3 


1 


DL3 GTTTCTAGTGGG 




30 


3 


0 


DL3ATCCAAACCAGG 


4 


1 


DL3AGTGGGGGGTGT 






4 


0 


DL3ACCAGGATCGGA 


5 


1 


DL3 GGGGTGTCAAAT 






5 


0 


DL3 CGTGTGTGTGTGG 


6 


1 


DL3 GTCAAATACATCG 






6 


0 


DL3 CGTGTGTGTGTGGC 


7 


1 


DL3 ACATCGAATGGAG 






7 


0 


DL3 TCGTGTGTGTGTGG 


8 


1 


DL3 CGAATGG AGGAG 




35 


8 


0 


DL3 GTAGGATGGGTC 


9 


1 


DL3GAGGAGTTTCGT 






9 


0 


DL3AGGATGGGTCGT 


10 


1 


DL3 TTTCGTTATGTGA 






10 


0 


DL3GATGGGTCGTGT 


11 


1 


DL3ATGTGACTTTTAC 






11 


0 


DL3TGGCGACGATTG 


12 


1 


DL3 G ACTTTTACAAAT 






12 


0 


DL3 GCGACG ATTGGG 


13 


1 


DL3AAATCTGCCCGA 




40 


13 


0 


DL3TGGGGGGGA 


14 


1 


DL3AATCTGCCCGAG 






14 


0 


DL3GAGGGGGCG 


15 


1 


DL3CCCGAGTGTAGT 






15 


0 


DL3 GGAGGGGGCGA 


16 


1 


DL3AGTGTAGTGGGG 






16 


0 


DL3GAGGGGGCGA 


0 


2 


DL3GGGAGGGTGAG 






0 


1 


DL3GGCTTGGTTGG 


1 


2 


DL3GGTGAGGGTATG 
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2 


"2 


DL3 GGTATG ATGATTAG 


8 


5 


DL3ATTGTTAAACTTA 




3 


2 


DL3 GATTAG AGTAAGT 


9 


5 


DL3AAACTTACAGACG 




4 


2 


DL3 TTAGAGTAAGTTA 


10 


5 


DL3ACAGACGTGTCG 




5 


2 


DL3AAGTTATGTTGGG 


11 


5 


DL3 GTGTCGGTGAAA 


5 


6 


2 


DL3GTTGGGGGCG 


12 


5 


DL3GTGAAAGGTGTGT 




7 


2 


DL3GGGGCGGGTA 


13 


5 


DL3GGTGTGTCTGTAG 




8 


2 


DL3GCGGGTAGGAT 


14 


5 


DL3TGTGTCTGTAGTA 




9 


2 


DL3 GGTAGGATGGGT 


15 


5 


DL3 GTAGTATTGTTTT 




10 


2 


DL3 GGATGGGTCGTG 


16 


5 


DL3AGTATTGTTTTTT 


10 


11 


2 


DL3GGTCGTGTGTGT 


0 


6 


DL3CCTCGTGGGATA 




12 


2 


DL3 GTGTGTGTGGCG 


1 


6 


DL3TGGGATACAGCG 




13 


2 


DL3TGTGGCGACGAT 


2 


6 


DL3GATACAGCGTCAT 




14 


2 


DL3 GACGATTGGGGT 


3 


6 


DL3 GCGTCATAGACAG 




15 


2 


DL3ATTGGGGTATGG 


4 


6 


DL3 AGACAG AAACTAA 


15 


16 


2 


DL3 GTATGGGGCTTG 


5 


6 


DL3 CAGAAACTAAGGA 




0 


3 


DL3 GG ATTGTGGTCG 


6 


6 


DL3TAAGGACGGAGT 




1 


3 


DL3 TGGTCGGATTGG 


7 


6 


DL3 G ACGGAGTAGGA 




2 


3 


DL3 GGATTGGTCTAAA 


8 


6 


DL3 GTAGGATAATAAA 




3 


3 


DL3 TCTAAAGTTTAAA 


9 


6 


DL3TAATAAATAGCG 


20 


4 


3 


DL3 GTTTAAAATAG AA 


10 


6 


DL3ATAGCGTAGGAT 




5 


3 


DL3ATAGAAAAACCG 


11 


6 


DL3TAGCGTAGGATG 




6 


3 


DL3AGAAAAACCGC 


12 


6 


DL3AGGATGCAAGTT 




7 


3 


DL3AACCGCCATAC 


13 


6 


DL3ATGCAAGTTATAA 




8 


3 


DL3 CCATACGTGAAAA 


14 


6 


DL3 GTTATAATGTCCG 


25 


9 


3 


DL3ACGTGAAAATTGT 


15 


6 


DL3ATGTCCGCTTGT 




10 


3 


DL3AATTGTCAGTGGG 


16 


6 


DL3TCCGCTTGTATG 




11 


3 


DL3TGTCAGTGGGGG 


0 


7 


DL3 GTGAGTGCCCTC 




12 


3 


DL3TGGGGGGTTGA 


1 


7 


DL3TGCCCTCGAGAG 




13 


3 


DL3 GGGTTGATTGTGT 


2 


7 


DL3 CCTCGAGAGGTA 


30 


14 


3 


DL3 TTGTGTAATAAAA 


3 


7 


DL3AGAGGTACGTAA 




15 


3 


DL3AATAAAAGGGGA 


4 


7 


DL3ACGTAAACCATA 




16 


3 


DL3 TAAAAGGGGAGG 


5 


7 


DL3ACCATAAAAGCAG 




0 


• 4 


DL3 GTTTTTTAAAGG 


6 


7 


DL3AAAGCAGACCC 




1 


4 


DL3TTTTAAAGGTGG 


7 


7 


DL3AGACCCCCCAT 


35 


2 


4 


DL3AGGTGGTTTGG 


8 


7 


DL3CCCCCATACGT 




3 


4 


DL3TTGGGGGGGAG 


9 


7 


DL3CATACGTGCGCT 




4 


4 


DL3GGAGGGGGCG 


10 


7 


DL3GTGCGCTATCAG 




5 


4 


DL3GGGGCGAAGAC 


11 


7 


DL3GCGCTATCAGTA 




6 


4 


DL3 G AAGACCGGATG 


12 


7 


DL3TCAGTAACGCTC 


40 


7 


4 


DL3 CCGGATGTCGTG 


13 


7 


DL3GTAACGCTCTGC 




8 


4 


DL3 GTCGTGAATTTGT 


14 


7 


DL3CTCTGCGACCTC 




9 


4 


DL3 CGTGAATTTGTGT 


15 


7 


DL3GACCTCGGCCT 




10 


4 


DL3TTGTGTAGAGACG 


16 


7 


DL3TCGGCCTCGTG 




11 


4 


DL3TAGAGACGGTTT 


0 


8 


DL3 GATGAAGTCCCAG 


45 


12 


4 


DL3ACGGTTTGGGG 


1 


8 


DL3AGTCCCAGTATTT 




13 


4 


DL3 TGGGGTTTTTGT 


2 


8 


DL3 GTATTTCGGATTT 




14 


4 


DL3 GGGTTTTTGTTT 


3 


8 


DL3TCGGATTTATCG 




15 


4 


DL3 TTGTTTCTTGGG 


4 


8 


DL3 GATTTATCGGGT 




16 


4 


DL3 TCTTGGGATTGTG 


5 


8 


DL3 ATCGGGTGTGCA 


50 


0 


5 


DL3 TGTATG AATGATTT 


6 


8 


DL3TGTGCAAGGGGA 




1 


5 


DL3 TGATTTCAC ACAA 


7 


8 


DL3 CAAGGGGAATTT 




2 


5 


DL3ACACAATTAATTAA 


8 


8 


DL3GAATTTATTCTGTA 




3 


5 


DL3AATTAATTACGAA 


9 


8 


DL3TCTGTAGTGCTAC 




4 


5 


DL3TACGAACATCCTG 


10 


8 


DL3 GTAGTG CTACCT 


55 


5 


5 


DL3ACGAACATCCTGT 


11 


8 


DL3GCTACCTAGTAG 




6 


5 


DL3TCCTGTATTATTA 


12 


8 


DL3 CTAGTAGTCCAGA 




7 


5 


DL3 GTATTATTATTGTT 


13 


8 


DL3TCCAGATAGTGGG 
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14 


8 


DL3AGATAGTGGGATA 


8 


12 


DL3TGTTCGTTCATGT 




15 


8 


DL3 GGGATAATTGGT 


9 


12 


DL3 CGTTCATGTCGTT 




16 


8 


DL3TAATTGGTGAGTG 


10 


12 


DL3 GTCGTTAGTTGG 




0 


9 


DL3TATAGGGCGTGT 


11 


12 


DL3 TAGTTGGGAGTT 


5 


1 


9 


DL3GGCGTGTTCTCA 


12 


12 


DL3GGAGTTGATAGTG 




2 


9 


DL3 GTGTTCTCACGAT 


13 


12 


DL3ATAGTGTGTAGTT 




3 


9 


DL3TCACGATGAGAGG 


14 


12 


DL3 GTGTAGTTGACGT 




4 


9 


DL3ATGAGAGGAGCG 


15 


12 


DL3 TGACGTTGAGGT 




5 


9 


DL3AGGAGCGAGGC 


16 


12 


DL3 CGTTGAGGTTTA 


10 


6 


9 


DL3CGAGGCCCGG 


5 


13 


DL3 TATAACATGCCAT 




7 


9 


DL3 GCCCGGGTATT 


6 


13 


DL3AACATGCCATGGT 




8 


9 


DL3 CGGGTATTGTGA 


7 


13 


DL3 CCATGGTATTTAT 




9 


9 


DL3 GTGAACCCCCAT 


8 


13 


DL3ATTTATGAACTGG 




10 


9 


DL3 CCCCATCGATTT 


9 


13 


DL3AACTGGTGGACAT 


15 


11 


9 


DL3ATCGATTTCACTT 


10 


13 


DL3TGGACATCATGTA 




12 


9 


DL3 TTTCACTTGACAT 


11 


13 


DL3 CATGTATTTTTGG 




13 


9 


DL3 TTGACATAGAGCT 


12 


13 


DL3 TTTTGGGTTAGG 




14 


9 


DL3 TAGAGCTGTAGAC 


13 


13 


DL3 GGGTTAGGATGT 




15 


9 


DL3 GTAGACCAAGGA 


14 


13 


DL3 GGATGTAGTTTTG 


20 


16 


9 


DL3ACCAAGGATGAAG 


15 


13 


DL3 TGTAGTTTTGGG 




0 


10 


DL3 CGTGTAATGTCAG 


16 


13 


DL3 TTTGGGGGAGG 




1 


10 


DL3 TGTCAGTTTAGGG 


5 


14 


DL3 GGGTTCATAACTG 




2 


10 


DL3 TCAGTTTAGGGA 


6 


14 


DL3ATAACTGAGTGGG 




3 


10 


DL3 TAGGGAAGAGCA 


7 


14 


DL3 AACTGAGTGGGT 


25 


4 


10 


DL3AAGAGCAGGGGT 


8 


14 


DL3 GTGGGTAGTTGT 




5 


10 


DL3 CAGGGGTACCTA 


9 


14 


DL3 GTAGTTGTTGGC 




6 


10 


DL3 GGTACCTACTGG 


10 


14 


DL3GTTGGCGATACA 




7 


10 


DL3 TACTGGGGGG A 


11 


14 


DL3 CGATACATAAAAG 




8 


10 


DL3GGGGGAGTCTAT 


12 


14 


DL3 TAAAAGCATGTAA 


30 


9 


10 


DL3AGTCTATCCCCA 


13 


14 


DL3 GCATGTAATGACG 




10 


10 


DL3ATCCCCAGGGA 


14 


14 


DL3ATGACGGTCGGT 




11 


10 


DL3 CAGGGAACTGGT 


15 


14 


DL3GTCGGTGGTACT 




12 


10 


DL3ACTGGTGGTAGG 


16 


14 


DL3 GGTACTTATAACA 




13 


10 


DL3 CTGGTGGTAGGA 


5 


15 


DL3 TCG ATTCTAAGAT 


35 


14 


10 


DL3GTAGGAGGCACA 


6 


15 


DL3TAAGATTAAATTT 




15 


10 


DL3 GGCACATTTAGT 


7 


15 


DL3AAATTTGAATAAG 




16 


10 


DL3 TTTAGTTATAGGG 


8 


15 


DL3AATAAGAGACAAG 




0 


11 


DL3AGGTTTACGGTG 


9 


15 


DL3AAGAGACAAGAAA 




1 


11 


DL3TACGGTGGGGA 


10 


15 


DL3AAGAAAGTACCC 


40 


2 


11 


DL3 GTGGGG AGTGG 


11 


15 


DL3AAAGTACCCCTT 




3 


11 


DL3 GGGAGTGGGTGA 


12 


15 


DL3 CCCCTTCGTCTA 




4 


11 


DL3 GGGTGATCCTATG 


13 


15 


DL3 CTTCGTCTAAAC 




5 


11 


DL3 CCTATGGTTGTTT 


14 


15 


DL3 CTAAACCCATGG 




6 


11 


DL3 GGTTGTTTGG ATG 


15 


15 


DL3AACCCATGGTGG 


45 


7 


11 


DL3 GTTTGGATGGGT 


16 


15 


DL3TGGTGGGTTCAT 




8 


11 


DL3ATGGGTGGGAAT 


5 


16 


DL3TTGGAAAAAGGT 




9 


11 


DL3 GGGAATTGTCATG 


6 


16 


DL3AAAAGGTTCCTG 




10 


11 


DL3 GTCATGTATC ATGT 


7 


16 


DL3GGTTCCTGTTTA 




11 


11 


DL3 TCATGTATTTCGG 


8 


16 


DL3 CCTGTTTAGTCTC 


50 


12 


11 


DL3 TATTTCGGTAAA 


9 


16 


DL3 TTAGTCTCTTTTT 




13 


11 


DL3 TTCGGTAAATGG 


10 


16 


DL3 CTTTTTCAGAAAT 




14 


11 


DL3GTAAATGGCATGT 


11 


16 


DL3 AG AAATTGAGGTG 




15. 


11 


DL3 GCATGTAATCGTG 


12 


16 


DL3AAATTGAGGTGGT 




16 


11 


DL3 GTAATCGTGTAAT 


13 


16 


DL3GGTGGTAATCGT 


55 


5 


12 


DL3GGGAGGGGTAC 


14 


16 


DL3TAATCGTGGGTT 




6 


12 


DL3 GGGTACGAATGT 


15 


16 


DL3 GTGGGTTTCGAT 




7 


12 


DL3ACGAATGTTCGTT 


16 


16 


DL3GGTTTCGATTCT 
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No probes were present in positions X, Y = 0, 12 to X, Y - 4, 
12; X, Y = 0, 13 to X, Y = 4, 13; X, Y = 0, 14 to X, Y = 4, 
14; X, Y = 0, 15 to X, Y = 4, 15; X, Y = 0, 16 to X, Y = 4, 
5 16; 

The length of each of the probes on the chip was variable to 
minimize differences in melting temperature and potential for 
cross-hybridization. Each position in the sequence was 
represented by at least one probe and most positions were 

10 represented by 2 or more probes. As noted above , the amount 
of overlap between the oligonucleotides varied from probe to 
probe. Figure 35 shows the human mitochondrial genome; "Ojj" 
is the H strand origin of replication, and arrows indicate the 
cloned unshaded sequence. 

15 DNA was prepared from hair roots of six human donors (mtl 

to mt6) and then amplified by PCR and cloned into M13; the 
resulting clones were sequenced using chain terminators to 
verify that the desired specific sequences were present. DNA 
from the sequenced M13 clones was amplified by PCR, 

20 transcribed in vitro, and labeled with f luorescein-UTP using 
T3 RNA polymerase. The 1.3 kb RNA transcripts were fragmented 
and hybridized to the chip. The results showed that each 
different individual had DNA that produced a unique 
hybridization fingerprint on the chip and that the differences 

25 in the observed patterns could be correlated with differences 
in the cloned genomic DNA sequence. The results also 
demonstrated that very long sequences of a target nucleic acid 
can be represented comprehensively as a specific set of 
overlapping oligonucleotides and that arrays of such probe 

30 sets can be usefully applied to genetic analysis. 

The sample nucleic acid was hybridized to the chip in a 
solution composed of 6 X SSPE, 0.1% Triton-X 100 for 60 
minutes at 15°C. The chip was then scanned by confocal 
scanning fluorescence microscopy. The individual features on 

35 the chip were 588 x 588 microns, but the lower left 5x5 
square features in the array did not contain probes. To 
quant itate the data, pixel counts were measured within each 
synthesis site.' Pixels represent 50 x 50 microns. The 
fluorescence intensity for each feature was scaled to a mean 
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determined from 27 bright features. After scanning, the chip 
was stripped and rehybridized; all six samples were hybridized 
to the same chip. Figure 3 6 shows the image observed from the 
mt4 sample on the DNA chip. Figure 37 shows the image 
5 observed from the mtS sample on the DNA chip. Figure 3 8 shows 
the predicted difference image between the mt4 and mtS samples 
on the DNA chip based on mismatches between the two samples 
and the reference sequence (see Anderson et al., supra). 
Figure 39 shows the actual difference image observed. 

10 The results show that, in almost all cases, mismatched 

probe/target hybrids resulted in lower fluorescence intensity 
than perfectly matched hybrids. Nonetheless, some probes 
detected mutations (or specific sequences) better than others, 
and in several cases, the differences were within noise 

15 levels. Improvements can be realized by increasing the amount 
of overlap between probes and hence overall probe density and, 
for duplex DNA targets, using a second set of probes, either 
on the same or a separate chip, corresponding to the second 
strand of the target. Figure 40, in sheets 1 and 2, shows a 

2 0 plot of normalized intensities across rows 10 and 11 of the 
array and a tabulation of the mutations detected. 

Figure 41 shows the discrimination between wild-type and 
mutant hybrids obtained with this chip. The median of the six 
normalized hybridization scores for each probe was taken. The 

25 graph plots the ratio of the median score to the normalized 
hybridization score versus mean counts. On this graph, a 
ratio of 1,6 and mean counts above 50 yield no false 
positives, and while it is clear that detection of some 
mutants can be improved, excellent discrimination is achieved, 

30 considering the small size of the array. Figure 42 
illustrates how the identity of the base mismatch may 
influence the ability to discriminate mutant and wild-type 
sequences more than the position of the mismatch within an 
oligonucleotide probe. The mismatch position is expressed as 

35 % of probe length from the 3 '-end. The base change is 

indicated on the graph. These results show that the DNA chip 
increases the capacity of the standard reverse dot blot format 
by orders of magnitude, extending the power of that approach 
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many fold and that the methods of the invention are more 
efficient and easier to automate than gel-based methods of 
nucleic acid sequence and mutation analysis. 

To illustrate further these advantages, a second chip was 
5 prepared for analyzing a longer segment from human 

mitochondrial DNA (mtDNA) . The chip "tiles" through 648 
nucleotides of a reference sequence comprising human H strand 
mtDNA from positions 16280 to 356, and allows analysis of each 
nucleotide in the reference sequence. The probes in the array 

10 are 15 nucleotides in length, and each position in the target 
sequence is represented by a set of 4 probes (A, C, G, T 
substitutions) , which differed from one another at position 7 
from the 3 '-end. The array consists of 13 blocks of 4 x 50 
probes: each block scans through 50 nucleotides of contiguous 

15 mtDNA sequence. The blocks are separated by blank rows. The 
4 corner columns contain control probes; there are a total of 
2600 probes in a 1.28 cm x 1.28 cm square area (feature), and 
each area is 256 x 197 microns. 

Target RNA was prepared as above. The RNA was fragmented 

20 and hybridized to the oligonucleotide array in a solution 

composed of 6X SSPE, 0.1% Triton X-100 for 60 minutes at 18**C. 
Unhybridized material was washed away with buffer, and the 
chip was scanned at 25 micron pixel resolution. 

Figure 43 provides a 5" to 3* sequence listing of one 

25 target corresponding to the probes on the chip. X is a 

control probe. Positions that differ in the target (i.e., are 
mismatched with the probe at the designated site) are in bold. 
Figure 44 shows the fluorescence image produced by scanning 
the chip when hybridized to this sample. About 95% of the 

30 sequence could be read correctly from only one strand of the 
original duplex target nucleic acid. Although some probes did 
not provide excellent discrimination and some probes did not 
appear to hybridize to the target efficiently, excellent 
results were achieved. The target sequence differed from the 

35 probe set at six positions: 4 transitions and 2 insertions. 
All 4 transitions were detected, and specific probes could 
readily be incorporated into the array to detect insertions or 
deletions. Figure 4 5 illustrates the detection of 4 
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transitions in the target sequence relative to the wild-type 
probes on the chip. 

A further chip was constructed comprising probes tiling 
across the entire D-loop region (1.3 kb) of mt DNA sequences 
5 from two humans. The probes were tiled in rows of four using 
the basic tiling strategy. The probes were overlapping 15 
mers having an interrogation position 7 nucleotides from the 
3' end. The complete group of probes tiled on the reference 
sequence from the first individual, designated mtl, occupied 

10 the upper half of the chip. The lower half of the chip 

contained a similar arrangement based on a second clone, mt2. 
The probes were synthesized in a 1.28 x 1.28 cm area, which 
contained a matrix of 115 x 120 cells. The chip contained a • 
total of 10,488 mtDNA probes. 

15 Six samples of target DNA was extracted form hair roots 

from six individuals. The 1.3 kb region spanning positions 
15935 to 667 of human mtDNA was PGR amplified, cloned in 
bacteriophage M13 and sequenced by conventional methods. The 
1.3 kb region was reamplified from the phage clone using 

20 primers, L15935-T3, 

5 • CTCGGAATTAACCCTCACTAAAGGAAACCTTTTTCCAAGGA and H667-T7, 
5'TAATACGACTCACTATAGGGAGAGGCTAGGACCAAACCTATT tagged with T3 
and T7 RNA polymerase promoter sequences. Labelled RNA was 
generated by in vitro transcription using T3 RNA polymerase 

25 and f luoresceinated nucleotides, fragmented, and hybridized to 
the mtDNA control region resequencing chip at room temperature 
for 60 min, in 6xSSPE + 0.05% triton X-100. Six washes were 
carried out at room temperature, using 6xSSPE + 0.005% triton 
X-100, and the chip was read. Signal intensities varied 

30 considerably over the chip, but the large dynamic range of the 
detection system allowed accurate quantitation of intensities 
over several orders of magnitude. Even relatively low signal 
intensities yielded accurate results. 

Five different clones (mtl-5) were hybridized, each to a 

35 separate chip. The reference sequence was also hybridized for 
comparative purposes. Mean counts per probe cell were 
determined, and used by automated basecalling software to read 
the sequence. The accuracy of sequence read from the chip is 
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summarized as follows • Combining the data from the five 
targets analyzed, the chip read a total of 6310 nucleotides. 
Of these nucleotides in the target sequences, 55 were 
different from the reference sequence (as judged by 
5 conventional sequencing) (41 of these 55 nucleotides were both 
detected and read correctly from the chip) . 6 of 55 
nucleotides were detected as being ambiguous but their 
identity could not be read. 2 of 55 nucleotides were detected 
as mutations, but their identity was miscalled. 6 of 55 
10 nucleotides were incorrectly called as wildtype. Of the 6255 
nucleotides in the target sequence that were identical to the 
reference sequence, only 36 (0.57%) were miscalled or scored 
as ambiguous. 

A further chip was constructed comprising probes tiling 

15 across a reference sequence comprising an entire mitochondrial 
genome. In this chip, a block tiling strategy was used. Each 
block was designed to analyze seven nucleotides from a target 
sequence. Each block consisted of four probe sets, the probe 
sets each having seven probes. A block was laid down on the 

20 chip in seven columns of four probes. The upper probe was the 
same in each column, this being a probe exactly complementary 
to a subsequence of the reference sequence. The three other 
probes in each column were identical to the upper probe except 
in an interrogation position, which was occupied by a 

25 different base in each of the four probes in the column. The 
interrogation position shifted by one position between 
successive columns. Thus, except for the seven interrogation 
positions, one in each of the columns of probes, all probes 
occupying a block were identical. The array comprised many 

30 such blocks, each tiled to successive subsequences of the 

mitochondrial DNA reference sequence. In all, the chip tiled 
15,569 nucleotides of reference sequence with double tiling at 
J 42 positions. 66,276 probes occupied an array of 304 x 315 

cells, each cell having an area of 42 x 41 microns. 

35 The chip was hybridized to the same target sequences as 

described for the D-loop region, except that hybridization was 
at 15**C for 2 hr. The chip was scanned at 5 micron resolution 
to give an image with approximately 64 pixels per cell. For 
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blocks of probes tiling across the D-loop region, a sequence- 
specific hybridization pattern was obtained. For other 
blocks, only background hybridization was observed. 

These results illustrate that longer sequences can be 
5 read using the DNA chips and methods of the invention, as 
compared to conventional sequencing methods, where reading 
length is limited by the resolution of gel electrophoresis. 
Hybridization and signal detection require less than an hour 
and can be readily shortened by appropriate choice of buffers, 
10 temperatures, probes, and reagents. 

III. MODES OF PRACTICING THE INVENTION 
VLSIPS^ Technolocp/ 
As noted above, the VLSIPS™ technology is described in a 

15 number of patent publications and is preferred for making the 
oligonucleotide arrays of the invention. A brief description 
of how this technology can be used to make and screen DNA 
chips is provided in this Example and the accompanying 
Figures. In the VLSIPS™ method, light is shone through a mask 

20 to activate functional (for oligonucleotides, typically an 
-OH) groups protected with a photoremovable protecting group 
on a surface of a solid support. After light activation, a 
nucleoside building block, itself protected with a 
photoremovable protecting group (at the 5' -OH), is coupled to 

25 the activated areas of the support. The process can be 
repeated, using different masks or mask orientations and 
building blocks, to prepare very dense arrays of many 
different oligonucleotide probes. The process is illustrated 
in Figure 46; Figure 47 illustrates how the process can be 

30 usdd to prepare "nucleoside combinatorials" or 

oligonucleotides synthesized by coupling all four nucleosides 
to form dimers, trimers and so forth. 

New methods for the combinatorial chemical synthesis of 
peptide, polycarbamate, and oligonucleotide arrays have 

35 recently been reported (see Fodor et al., 1991, Science 251: 
767-773; Cho et al., 1993, Science 261: 1303-1305; and 
Southern et al., 1992, Genomics 13: 1008-10017, each of which 
is incorporated herein by reference). These arrays, or 
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biological chips (see Fodor et al., 1993, Nature 364: 555-556, 
incorporated herein by reference) , harbor specific chemical 
compounds at precise locations in a high-density, information 
rich format, and are a powerful tool for the study of 
5 biological recognition processes. A particularly exciting 
application of the array technology is in the field of DNA 
sequence analysis. The hybridization pattern of a DNA target 
to an array of shorter oligonucleotide probes is used to gain 
primary structure information of the DNA target. This format 

10 has important applications in sequencing by hybridization, DNA 
diagnostics and in elucidating the thermodynamic parameters 
affecting nucleic acid recognition. 

Conventional DNA sequencing technology is a laborious 
procedure requiring electrophoretic size separation of labeled 

15 DNA fragments. An alternative approach, termed Sequencing By 
Hybridization (SBH) , has been proposed (Lysov et al., 1988, 
Dokl. Akad. Nauk SSSR 303:1508-1511; Bains et al., 1988, J. 
Theor. Biol. 135:303-307; and Drmanac et al., 1989, Genomics 
4:114-128, incorporated herein by reference). This method 

20 uses a set of short oligonucleotide probes of defined sequence 
to search for complementary sequences on a longer target 
strand of DNA. The hybridization pattern is used to 
reconstruct the target DNA sequence. It is envisioned that 
hybridization analysis of large numbers of probes can be used 

25 to sequence long stretches of DNA. In immediate applications 
of this hybridization methodology, a small number of probes 
can be used to interrogate local DNA sequence. 

The strategy of SBH can be illustrated by the following 
example. A 12-mer target DNA sequence, AGCCTAGCTGAA, is mixed 

30 with a complete set of octanucleotide probes. If only perfect 
complementarity is considered, five of the 65,536 octamer 
probes -TCGGATCG, CGGATCGA, GGATCGAC, GATCGACT, and ATCGACTT 
will hybridize to the target. Alignment of the overlapping 
sequences from the hybridizing probes reconstructs the 

35 complement of the original 12-mer target: 

TCGGATCG 
CGGATCGA 
GGATCGAC 
40 GATCGACT 
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ATCGACTT 
TCGGATCGACTT 

Hybridization methodology can be carried out by attaching 
5 target DNA to a surface. The target is interrogated with a 

set of oligonucleotide probes, one at a time (see Strezoska et * 
al., 1991, Proc. Natl. Acad. Sex. USA 88:10089-10093, and 
Drmanac et al., 1993, Science 260:1649-1652, each of which is 
incorporated herein by reference) . This approach can be 

10 implemented with well established methods of immobilization 
and hybridization detection, but involves a large number of 
manipulations. For example, to probe a sequence utilizing a 
full set of octanucleotides, tens of thousands of 
hybridization reactions must be performed. Alternatively, SBH 

15 can be carried out by attaching probes to a surface in an 

array format where the identity of the probes at each site is 
known. The target DNA is then added to the array of probes. 
The hybridization pattern determined in a single experiment 
directly reveals the identity of all complementary probes. 

20 As noted above, a preferred method of oligonucleotide 

probe array synthesis involves the use of light to direct the 
synthesis of oligonucleotide probes in high-density, 
miniaturized arrays. Photolabile 5 '-protected 
N-acyl-deoxynucleoside phosphoramidites, surface linker 

25 chemistry, and versatile combinatorial synthesis strategies 
have been developed for this technology. Matrices of 
spatially-defined oligonucleotide probes have been generated, 
and the ability to use these arrays to identify complementary 
sequences has been demonstrated by hybridizing fluorescent 

30 labeled oligonucleotides to the DNA chips produced by the 

methods. The hybridization pattern demonstrates a high degree 
of base specificity and reveals the sequence of 
oligonucleotide targets. 

The basic strategy for light-directed oligonucleotide > 

35 synthesis (1) is outlined in Fig. 46. The surface of a solid 
support modified with photolabile protecting groups (X) is 
illuminated through a photolithographic mask, yielding 
reactive hydroxyl groups in the illuminated regions. A 
3 '-0-phosphoramidite activated deoxynucleoside (protected at 
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the 5'-hydroxyl with a photolabile group) is then presented to 
the surface and coupling occurs at sites that were exposed to 
light. Following capping, and oxidation, the substrate is 
rinsed and the surface illuminated through a second mask, to 
5 expose additional hydroxyl groups for coupling. A second 
5 '-protected, 3 » -0-phosphoramidite activated deoxynucleoside 
is presented to the surface. The selective photodeprotection 
and coupling cycles are repeated until the desired set of 
products is obtained. 

10 Light directed chemical synthesis lends itself to highly 

efficient synthesis strategies which will generate a maximum 
number of compounds in a minimum number of chemical steps. 
For example, the complete set of 4" polynucleotides (length 
n) , or any subset of this set can be produced in only 4 x n 

15 chemical steps. See Fig. 47. The patterns of illumination 
and the order of chemical reactants ultimately define the 
products and their locations. Because photolithography is 
used, the process can be miniaturized to generate high-density 
arrays of oligonucleotide probes. For an example of the 

20 nomenclature useful for describing such arrays, an array 
containing all possible octanucleotides of dA and dT is 
written as (A+T)®, Expansion of this polynomial reveals the 
identity of all 256 octanucleotide probes from AAAAAAAA to 
TTTTTTTT. A DNA array composed of complete sets of 

25 dinucleotides is referred to as having a complexity of 2. The 
array given by {A+T+C+G)8 is the full 65,53 6 octanucleotide 
array of complexity four. Computer-aided methods of laying 
down predesigned arrays of probes using VLSIPS™ technology are 
described in commonly-assigned co-pending application USSN 

30 08/249,188, filed May 24, 1994 (incorporated by reference in 
its entirety for all purposes) . 

To carry out hybridization of DNA targets to the probe 
arrays, the arrays are mounted in a thermostatically 
controlled hybridization chamber. Fluorescein labeled DNA 

35 targets are injected into the chamber and hybridization is 
allowed to proceed for 5 min to 24 hr. The surface of the 
matrix is scanned in an epif luorescence microscope (Zeiss 
Axioscop 20) equipped with photon counting electronics using 
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50 100 fiW Of 4 88 nm excitation from an Argon ion laser 
(Spectra Physics Model 2020) . Measurements may be made with 
the target solution in contact with the probe matrix or after 
washing. Photon counts are stored and image files are 
5 presented after conversion to an eight bit image format. See 
Fig. 51. 

When hybridizing a DNA target to an oligonucleotide 
array, N = Lt-(Lp-l) complementary hybrids are expected, where 
N is the number of hybrids, Lt is the length of the DNA 

10 target, and Lp is the length of the oligonucleotide probes on 
the array. For example, for an 11-mer target hybridized to an 
octanucleotide array, N - 4. Hybridizations with mismatches 
at positions that are 2 to 3 residues from either end of the 
probes will generate detectable signals. Modifying the above 

15 expression for N, one arrives at a relationship estimating the 
number of detectable hybridizations (Nd) for a DNA target of 
length Lt and an array of complexity C. Assuming an average 
of 5 positions giving signals above background: 
Nd = (1 + 5(C-1)) [Lt-(Lp-l)]. 

20 Arrays of oligonucleotides can be efficiently generated 

by light-directed synthesis and can be used to determine the 
identity of DNA target sequences. Because combinatorial 
strategies are used, the number of compounds increases 
exponentially while the number of chemical coupling cycles 

25 increases only linearly. For example, synthesizing the 

complete set of 4® (65,536) octanucleotides will add only four 
hours to the synthesis for the 16 additional cycles. 
Furthermore, combinatorial synthesis strategies can be 
implemented to generate arrays of any desired composition. 

30 For example, because the entire set of dodecamers (4^) can be 
produced in 48 photolysis and coupling cycles (b^ compounds 
requires b x n cycles) , any subset of the dodecamers 
(including any subset of shorter oligonucleotides) can be 
constructed with the correct lithographic mask design in 48 or 

35 fewer chemical coupling steps. In addition, the number of 
compounds in an array is limited only by the density of 
synthesis sites and the overall array size. Recent 
experiments have demonstrated hybridization to probes 
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synthesized in 25 fim sites. At this resolution, the entire 
set of 65,536 octanucleotides can be placed in an array 
measuring 0.64 cm square, and the set of 1,048,576 
dodecanucleotides requires only a 2.56 cm array. 
5 Genome sequencing projects will ultimately be limited by 

DNA sequencing technologies. Current sequencing methodologies 
are highly reliant on complex procedures and require 
substantial manual effort. Sequencing by hybridization has 
the potential for transforming many of the manual efforts into 

10 more efficient and automated formats. Light-directed 

synthesis is an efficient means for large scale production of 
miniaturized arrays for SBH. The oligonucleotide arrays are 
not limited to primary sequencing applications. Because 
single base changes cause multiple changes in the 

15 hybridization pattern, the oligonucleotide arrays provide a 
powerful means to check the accuracy of previously elucidated 
DNA sequence, or to scan for changes within a sequence. In 
the case of octanucleotides, a single base change in the 
target DNA results in the loss of eight complements, and 

20 generates eight new complements. Matching of hybridization 
patterns may be useful in resolving sequencing ambiguities 
from standard gel techniques, or for rapidly detecting DNA 
mutational events. The potentially very high information 
content of light-directed oligonucleotide arrays will change 

25 genetic diagnostic testing. Sequence comparisons of hundreds 
to thousands of different genes will be assayed simultaneously 
instead of the current one, or few at a time format. Custom 
arrays can also be constructed to contain genetic markers for 
the rapid identification of a wide variety of pathogenic 

30 organisms. 

Oligonucleotide arrays can also be applied to study the 
sequence specificity of RNA or protein-DNA interactions. 
Experiments can be designed to elucidate specificity rules of 
non Watson-Crick oligonucleotide structures or to investigate 
35 the use of novel synthetic nucleoside analogs for antisense or 
triple helix applications. Suitably protected RNA monomers 
may be employed for RNA synthesis. The oligonucleotide arrays 
should find broad application deducing the thermodynamic and 
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kinetic rules governing formation and stability of 
oligonucleotide complexes. 

Other than the use of photoremovable protecting groups, 
the nucleoside coupling chemistry is very similar to that used 
5 routinely today for oligonucleotide synthesis. Fig. 48 shows 
the deprotection, coupling, and oxidation steps of a solid 
phase DNA synthesis method. Fig. 49 shows an illustrative 
synthesis route for the nucleoside building blocks used in the 
method. Fig. 50 shows a preferred photoremovable protecting 

10 group, MeNPOC, and how to prepare the group in active form. 
The procedures described below show how to prepare these 
reagents. The nucleoside building blocks are 
5 • -MeNPOC-THYMIDINE-3 ' -OCEP ; 5 • -MeNPOC-N'*-t-BUTYL 
PHENOXYACETYL-DEOXYCYTIDINE-3 ' -OCEP ; 5 ' -MeNPOC-N'^-t-BUTYL 

15 PHENOXYACETYL-DEOXYGUANOSINE-3 ^ -OCEP ; and 5 ' -MeNPOC-N^-t-BUTYL 
PHENOXYACETYL-DEOXYADENOSINE-3 ' -OCEP.. 

1. Preparation of 4 , 5-methvlenedioxv-2-nitroacetophenone 




20 

A solution of 50 g (0.305 mole) 3,4-methylenedioxy- 
acetophenone (Aldrich) in 200 mL glacial acetic acid was added 
dropwise over 30 minutes to 700 mL of cold (2-4 **C) 70% HNO3 

25 with stirring (NOTE: the reaction will overheat without 

external cooling from an ice bath, which can be dangerous and 
. lead to side products) . At temperatures below 0®C, however, 
the reaction can be sluggish. A temperature of 3-5**C seems to 
be optimal) . The mixture was left stirring for another 60 

30 minutes at 3-5**C, and then allowed to approach ambient 

temperature. Analysis by TLC (25% EtOAc in hexane) indicated 
complete conversion of the starting material within 1-2 hr. 
When the reaction was complete, the mixture was poured into "3 
liters of crushed ice, and the resulting yellow solid was 
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filtered off, washed with water and then suction-dried. Yield 
-53 g (84%), used without further purification. 

7, Preparation of 1- f 4 . 5-Methvlenedio xv-2"nitrophenvl^ 
5 ethanol 




Sodium borohydride (lOg; 0.27 mol) was added slowly to a cold, 

10 stirring suspension of 53g (0.25 mol) of 

4 ,5-methylenedioxy-2-nitroacetophenone in 400 mL methanol. 
The temperature was kept below 10*^0 by slow addition of the 
NaBH4 and external cooling with an ice bath. Stirring was 
continued at ambient temperature for another two hours, at 

15 which time TLC (CHjClj) indicated complete conversion of the 
ketone. The mixture was poured into one liter of ice-water 
and the resulting suspension was neutralized with ammoniim 
chloride and then extracted three times with 400 mL CH2CI2 or 
EtOAc (the product can be collected by filtration and washed 

20 at this point, but it is somewhat soluble in water and this 
results in a yield of only "60%) . The combined organic 
extracts were washed with brine, then dried with MgS04 and 
evaporated. The crude product was purified from the main 
byproduct by dissolving it in a minimum volume of CH2CI2 or 

25 THF(^175 ml) and then precipitating it by slowly adding hexane 
(1000 ml) while stirring (yield 51g; 80% overall). It can 
also be recrystallized (e.g., toluene-hexane) , but this 
reduces the yield. 
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4. Synthesis of 5'- Menpoc-2 ' -deoxvnucleoside-3 ' - 
fN ,N*diisopropvl 2-cvanoethvl phosphoramidites 
(a . ) 5 ' -MeNPOC-Nucleosides 



HO 




Base 



Menpoc CI 
■- ^ 

Pyridine 



MenpocO 




Base 



HO 



HO 



Base= THYMIDINE (T) ; N-4-isoBUTyRYL 2 ' -DEOXYcYTiDiNE (ibu-dC); 
N-2-PHENOXYACETYL 2'DEOXYGUANOSINE (PAC-dG) ; and 
10 N- 6 -PHENOXY ACETYL 2 ' DEOXYADENOSINE (PAC-dA) 

All four of the 5'-MeNP0C nucleosides were prepared from the 
base-protected 2 • -deoxynucleosides by the following procedure. 
The protected 2 " -deoxynucleoside (90 ronole) was dried by 

15 co-evaporating twice with 250 toL anhydrous pyridine. The 

nucleoside was then dissolved in 300 laL anhydrous pyridine (or 
1:1 pyridine/DMF, for the dG^^^ nucleoside) under argon and 
cooled to "2*'C in an ice bath. A solution of 24. 6g (90 
imole) MeNPOC-Cl in 100 mL dry THF was then added with 

20 stirring over 30 minutes. The ice bath was removed, and the 
solution allowed to stir overnight at room temperature (TLC: 
5-10% MeOH in CH2CI2. two diastereomers) ^ After evaporating 
the solvents under vacuum, the crude material was taken up in 
250 mL ethyl acetate and extracted with saturated aqueous 

25 NaHC03 and brine. The organic phase was then dried over 

Na2S04^ filtered and evaporated to obtain a yellow foam. The 
crude products were finally purified by flash chromatography 
(9 X 30 cm silica gel column eluted with a stepped gradient of 
2% - 6% MeOH in CH2CI2) • Yields of the purified diastereomeric 

30 mixtures are in the range of 65-75%. 
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(h.) 5 ' - MenpQc-2 ^ -deoxvnucleo5ide-3 ' - f N > N-diisopropyl 
2-cvanoethvl phosphoramidites) 




5 

The four deoxynucleosides were phosphitylated using either 2- 
cyanoethyl- N,N- diisopropyl chlorophosphoramidite, or 2- 
cyanoethyl- N,N,N\N'- tetraisopropylphosphorodiainidite. The 

10 following is a typical procedure. Add 16. 6g (17.4 lal; 55 

mmole) of 2- cyanoethyl- N,N,N\N'- tetraisopropylphosphoro- 
diamidite to a solution of 50 mmole 5'- MeNPOC-nucleoside and 
4.3g (25 mmole) diisopropylammonium tetrazolide in 250 mL dry 
CH2CI2 under argon at ambient temperature. Continue stirring 

15 for 4-16 hours (reaction monitored by TLC: 45:45:10 

hexane/CH2Cl2/Et3N) . Wash the organic phase with saturated 
aqueous NaHC03 and brine, then dry over Na2S04, and evaporate 
to dryness. Purify the crude amidite by flash chromatography 
(9 X 25 cm silica gel column eluted with hexane/CH2Cl2/TEA - 

20 45:45:10 for A, C, T; or 0:90:10 for G) . The yield of 
purified amidite is about 90%. 

B. PREPARATION OF LABELED DNA /HYBRIDIZATION TO ARRAY 

25 1. PGR 

PGR amplification reactions are typically conducted in a 
mixture composed of, per reaction: 1 ^1 genomic DNA; 10 fil 
each primer (10 pmol//xl stocks); 10 ^1 10 x PGR buffer (100 mM 
Tris.Gl pH8.5, 500 mM KGl, 15 mM MgGlj) ; 10 ^1 2 mM dNTPs 

30 (made from 100 mM dNTP stocks); 2.5 U Tag polymerase (Perkin 
Elmer AmpliTaq™, 5 U//1I) ; and H2O to 100 ^1. The cycling 
conditions are usually 40 cycles (94**G 45 sec, 55**G 30 sec, 
72 "^C 60 sec) but may need to be varied considerably from 
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sample type to sample type. These conditions are for 0.2 mL 
thin wall tubes in a Perkin Elmer 9 600 thermocycler . See 
Perkin Elmer 1992/93 catalogue for 9600 cycle time 
information. Target, primer length and sequence composition, 
5 among other factors, may also affect parameters. 

For products in the 200 to 1000 bp size range, check 2 ^1 
of the reaction on a 1.5% 0.5x TBE agarose gel using an 
appropriate size standard (phiX174 cut with Haelll is 
convenient) . The PGR reaction should yield several picomoles 

10 of product. It is helpful to include a negative control 

(i.e., 1 til TE instead of genomic DNA) to check for possible 
contamination • To avoid contamination, keep PGR products from 
previous experiments away from later reactions, using filter 
tips as appropriate. Using a set of working solutions and 

15 storing master solutions separately is helpful, so long as one 
does not contaminate the master stock solutions. 

For simple amplifications of short fragments from genomic 
DNA it is, in general, unnecessary to optimize Mg^**" 
concentrations. A good procedure is the following: make a 

20 master mix minus enzyme; dispense the genomic DNA samples to 
individual tubes or reaction wells; add enzyme to the master 
mix; and mix and dispense the master solution to each well, 
using a new filter tip each time. 

25 2 , PURIFIGATION 

Removal of unincorporated nucleotides and primers from 
PGR samples can be accomplished using the Promega Magic PGR 
Preps DNA purification kit. One can purify the whole sample, 
following the instructions supplied with the kit (proceed from 

30 section IIIB, 'Sample preparation for direct purification from 
PGR reactions'). After elution of the PGR product in 50 fil of 
TE or HjO, one centrifuges the eluate for 20 sec at 12,000 rpm 
in a microfuge and carefully transfers 45 /il to a new 
microfuge tube, avoiding any visible pellet. • Resin is 

35 sometimes carried over during the elution step. This transfer 
prevents accidental contamination of the linear amplification 
reaction with 'Magic PGR' resin. Other methods, e.g., size 
exclusion chromatography, may also be used. 
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3> Linear amplification 

In a 0.2 mL thin-wall PGR tube mix: 4 ^il purified PGR 
product; 2 ^1 primer (10 pmol/^il) ; 4 ;xl 10 x PGR buffer; 4 /il 
dNTPs (2 mM dA, dC, dC, 0.1 mM dT) ; 4 /xl 0.1 mM dUTP; 1 /xl 1 
mM fluorescein dUTP (Amersham RPN 2121) ; 1 U Taq polymerase 
(Perkin Elmer, 5 U/^1) ; and add H20 to 40 ;il. Conduct 40 
cycles (92«*C 30 sec, SS'^C 30 sec, 72'»C 90 sec) of PGR. These 
conditions have been used to amplify a 3 00 nucleotide 
mitochondrial DNA fragment but are applicable to other 
fragments. Even in the absence of a visible product band on 
an agarose gel, there should still be enough product to give 
an easily detectable hybridization signal. If one is not 
treating the DNA with uracil DNA glycosylase (see Section 4), 
dUTP can be omitted from the reaction. 

4. Fragmentation 

Purify the linear amplification product using the Promega 
Magic PGR Preps DNA purification kit, as per Section 2 above. 
In a 0.2 mL thin-wall PGR tube mix: 40 fil purified labeled 
DNA; 4 /il 10 X PGR buffer; and 0.5 fil uracil DNA glycosylase 
(BRL lU/;il) . Incubate the mixture 15 min at ST^'G, then 10 min 
at 97 *C; store at -2 0*G until ready to use. 

5, Hybridization. Scanning & Stripping 

A blank scan of the slide in hybridization buffer only is 
helpful to check that the slide is ready for use. The buffer 
is removed from the flow cell and replaced with 1 mL of 
(fragmented) DNA in hybridization buffer and mixed well. The 
scan is performed in the presence of the labeled target. Fig. 
51 illustrates an illustrative detection system for scanning a 
DNA chip. A series of scans at 30 min intervals using a 
hybridization temperature of 25 *G yields a very clear signal, 
usually in at least 30 min to two hours, but it may be 
desirable to hybridize longer, i.e., overnight. Using a laser 
power of 50 /iW and 50 /im pixels, one should obtain maximum 
counts in the range of hundreds to low thousands /pixel for a 
new slide. When finished, the slide can be stripped using 50% 
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formamide. rinsing well in deionized HjO, blowing dry, and 
storing at room temperature. 

C- PREPARATION OF LABELED RNA /HYBRIDIZATION TO ARRAY 
5 1. Tagged primers 

The primers used to amplify the target nucleic acid 

should have promoter sequences if one desires to produce RNA 

from the amplified nucleic acid. Suitable promoter sequences 

are shown below and include: 
10 (1) the T3 promoter sequence: 

5 • -CGGAATTAACCCTCACTAAAGG 

5 • -AATTAACCCTCACTAAAGGGAG ; 

(2) the T7 promoter sequence: 

5* TAATACGACTCACTATAGGGAG ; 
15 and (3) the SP6 promoter sequence: 

51 ATTTAGGTGACACTATAGAA. 



The desired promoter sequence is added to the 5 ' end of the 
PGR primer. It is convenient to add a different promoter to 

20 each primer of a PGR primer pair so that either strand may be 
transcribed from a single PGR product. 

Synthesize PGR primers so as to leave the DMT group on. 
DMT-on purification is unnecessary for PGR but appears to be 
important for transcription. Add 25 /xl 0.5M NaOH to 

25 collection vial prior to collection of oligonucleotide to keep 
the DMT group on. Deprotect using standard chemistry — 55 ®C 
overnight is convenient. 

HPLG purification is accomplished by drying down the 
oligonucleotides, resuspending in 1 mL 0.1 M TEAA (dilute 2.0 

30 M stock in deionized water, filter through 0.2 micron filter) 
and filter through 0.2 micron filter. Load 0.5 mL on reverse 
' phase HPLG (column can be a Hamilton PRP-1 semi-prep, #79426). 
The gradient is 0 -> 50% GH3GN over 25 min (program 0.2 
^mol. prep. 0-50, 25 min). Pool the desired fractions, dry down, 

35 resuspend in 200 /xl 80% HAc. 30 min RT. Add 200 /il EtOH; dry 
down. Resuspend in 200 /xl HjO, plus 20 /xl NaAc pH5.5, 600 /xl 
EtOH. Leave 10 min on ice; centrifuge 12,000 rpm for 10 min 
in microfuge. Pour off supernatant. Rinse pellet with 1 mL 
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EtOH, dry, resuspend in 200 ^1 H20, Dry, resuspend in 2 00 ^1 
TE. Measure A260, prepare a 10 pmol/fil solution in TE (10 mM 
Tris.Cl pH 8.0, 0*1 mM EDTA) . Following HPLC purification of 
a 42 mer, a yield in the vicinity of 15 nmol from a 0.2 fimol 
5 scale synthesis is typical. 

2. Genomic DNA Preparation 

Add 500 ^1 (10 mM Tris.Cl pHS.O, 10 mM EDTA, 100 mM 
NaCl, 2% (w/v) SDS, 40 mM DTT, filter sterilized) to the 

10 sample. Add 1.25 /xl 20 mg/ml proteinase K (Boehringer) 
Incubate at 55*»C for 2 hours, vortexing once or twice. 
Perform 2x 0.5 mL 1:1 phenol :CHCl3 extractions. After each 
extraction, centrifuge 12,000 rpm 5 min in a microfuge and 
recover 0.4 mL supernatant. Add 35 fil NaAc pH5.2 plus 1 mL 

15 EtOH. Place sample on ice 45 min; then centrifuge 12,000 rpm 
30 min, rinse, air dry 30 min, and resuspend in 100 ^1 TE. 



3. PGR 

PGR is performed in a mixture containing, per reaction: 
20 1 /il genomic DNA; 4 ^1 each primer (10 pmol//xl stocks); 4 /il 
10 X PGR buffer (100 mM Tris.Cl pH8.5, 500 mM KCl, 15 mM 
MgCl2) ; 4 /il 2 mM dNTPs (made from 100 mM dNTP stocks); 1 U 
Tag polymerase (Perkin Elmer, 5 U//xi) ; HjO to 40 /il. About 40 
cycles (94®C 30 sec, 55**C 30 sec, 72*C 30 sec) are performed, 
25 but cycling conditions may need to be varied. These conditions 
are for 0.2 mL thin wall tubes in Perkin Elmer 9600. For 
products in the 200 to 1000 bp size range, check 2 /il of the 
reaction on a 1.5% O.SxTBE agarose gel using an appropriate 
size standard. For larger or smaller volumes (20 - 100 fil) , 
30 one can use the same amount of genomic DNA but adjust the 
other ingredients accordingly. 

4. In vitro transcription 

Mix: 3 /xl PGR product; 4 fil 5x buffer; 2 /xl DTT; 2.4 /xl 
35 10 mM rNTPs (100 mM solutions from Pharmacia); 0.48 /xl 10 mM 
f luorescein-UTP (Fluorescein-12-UTP, 10 mM solution, from 
Boehringer Mannheim); 0.5 /xl RNA polymerase (Promega T3 or T7 
RNA polymerase) ; and add to 20 /xl. Incubate at 37**G for 3 
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h. Check 2 ^1 of the reaction on a 1.5% O.SxTBE agarose gel 
using a size standard. 5x buffer is 200 mM Tris pH 7.5, 30 mM 
MgClj; 10 mM spermidine, 50 mM NaCl, and 100 mM DTT (supplied 
with enzyme) . The PCR product needs no purification and can 
5 be added directly to the transcription mixture, A 20 /xl 
reaction is suggested for an initial test experiment and 
hybridization; a 100 fil reaction is considered "preparative" 
scale (the reaction can be scaled up to obtain more target) . 
The amount of PCR product to add is variable; typically a PCR 

10 reaction will yield several picomoles of DNA. If the PCR 
reaction does not produce that much target, then one should 
increase the amount of DNA added to the transcription reaction 
(as well as optimize the PCR) . The ratio of f luorescein-UTP 
to OTP suggested above is 1:5, but ratios from 1:3 to 1:10 - 

15 all work well. One can also label with biotin-OTP and detect 
with streptavidin-FITC to obtain similar results as with 
f luorescein-OTP detection. 

For nondenaturing agarose gel electrophoresis of RNA, 
note that the RNA band will normally migrate somewhat faster 

20 than the DNA template band, although sometimes the two bands 
will comigrate. The temperature of the gel can effect the 
migration of the RNA band. The RNA produced from in vitro 
transcription is quite stable and can be stored for months (at 
least) at -20*C without any evidence of degradation. It can 

25 be stored in unsterilized 6XSSPE 0.1% triton X-100 at -20**C 
for days (at least) and reused twice (at least) for 
hybridization, without taking any special precautions in 
preparation or during use. RNase contamination should of 
course be avoided. When extracting RNA from cells, it is 

30 preferable to work very rapidly and to use strongly denaturing 
conditions. Avoid using glassware previously contaminated 
with RNases. Use of new disposable plasticware (not 
necessarily sterilized) is preferred, as new plastic tubes, 
tips, etc., are essentially RNase free. Treatment with DEPC 

35 or autoclaving is typically not necessary. 
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5. Fracnmentation 

Heat transcription mixture at 94 degrees for forty nin. 
The extent of fragmentation is controlled by varying Mg^'*' 
concentration (30 mM is typical) , temperature, and duration of 

5 heating. 

6> Hybridization, Scanning > & Stripping 

A blank scan of the slide in hybridization buffer only is 
helpful to check that the slide is ready for use. The buffer 
is removed from the flow cell and replaced with 1 mL of 

10 (hydrolysed) RNA in hybridization buffer and mixed well. 

Incubate for 15 - 30 min at 18**C. Remove the hybridization 
solution, which can be saved for subsequent experiments. 
Rinse the flow cell 4-5 times with fresh changes of 6 x SSPE 
/ 0.1% Triton X-100, equilibrated to 18**C. The rinses can be 

15 performed rapidly, but it is important to empty the flow cell 
before each new rinse and to mix the liquid in the cell 
thoroughly. A series of scans at 3 0 min intervals using a 
hybridization temperature of 25*»C yields a very clear signal, 
usually in at least 3 0 min to two hours, but it may be 

20 desirable to hybridize longer, i.e., overnight. Using a laser 
power of 50 and 50 ^m pixels, one should obtain maximum 
counts in the range of hundreds to low thousands/pixel for a 
new slide. When finished, the slide can be stripped using 
warm water. 

25 These conditions are illustrative and assume a probe 

length of '15 nucleotides. The stripping conditions suggested 
are fairly severe, but some signal may remain on the slide if 
the washing is not stringent. Nevertheless, the counts 
remaining after the wash should be very low in comparison to 

30 the signal in presence of target RNA. In some cases, much 
gentler stripping conditions are effective. The lower the 
hybridization temperature and the longer the duration of 
hybridization, the more difficult it is to strip the slide. 
Longer targets may be more difficult to strip than shorter 

35 targets. 

7. Amplification of Signal 

A variety of methods can be used to enhance detection of 
labelled targets bound to a probe on the array. In one 
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embodiment, the protein MutS (from E. coli) or equivalent 
proteins such as yeast MSHl, MSH2, and MSH3 ; mouse Rep-3 , and 
Streptococcus Hex-A, is used in conjunction with target 
hybridization to detect probe-target complex that contain 
5 mismatched base pairs. The protein, labeled directly or 
indirectly, can be added to the chip during or after 
hybridization of target nucleic acid, and differentially binds 
to homo- and heteroduplex nucleic acid. A wide variety of 
dyes and other labels can be used for similar purposes. For 
10 instance, the dye YOYO-l is known to bind preferentially to 
nucleic acids containing sequences comprising runs of 3 or 
more G residues. 



8. Detection of Repeat Sequences 

15 In some circumstances, i.e., target nucleic acids with 

repeated sequences or with high G/C content, very long probes 
are sometimes required for optimal detection. In one 
embodiment for detecting specific sequences in a target 
nucleic acid with a DNA chip, repeat sequences are detected as 

20 follows. The chip comprises probes of length sufficient to 

extend into the repeat region varying distances from each end. 
The sample, prior to hybridization, is treated with a labelled 
oligonucleotide that is complementary to a repeat region but 
shorter than the full length of the repeat. The target 

25 nucleic is labelled with a second, distinct label. After 

hybridization, the chip is scanned for probes that have bound 
both the labelled target and the labelled oligonucleotide 
probe; the presence of such bound probes shows that at least 
two repeat sequences are present. 

30 

While the foregoing invention has been described in some 
detail for purposes of clarity and understanding, it will be 
clear to one skilled in the art from a reading of this 
disclosure that various changes in form and detail can be made 
35 without departing from the true scope of the invention. All 
publications and patent documents cited in this application 
are incorporated by reference in their entirety for all 
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purposes to the same extent as if each individual publication 
or patent document were so individually denoted. 
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WHAT IS CLAIMED IS: 
General tiling claims 

1 1, An array of oligonucleotide probes immobilized on a 

2 solid support, the array comprising at least two sets of 

3 oligonucleotide probes, 

4 (1) a first probe set comprising a plurality of 

5 probes, each probe comprising a segment of at least three 

6 nucleotides exactly complementary to a subsequence of the 

7 reference sequence, the segment including at least one 

8 interrogation position complementary to a corresponding 

9 nucleotide in the reference sequence, 

10 (2) a second probe set comprising a corresponding 

11 probe for each probe in the first probe set, the corresponding 

12 probe in the second probe set being identical to a sequence 

13 comprising the corresponding probe from the first probe set or 

14 a subsequence of at least three nucleotides thereof that 

15 includes the at least one interrogation position, except that 

16 the at least one interrogation position is occupied by a 

17 different nucleotide in each of the two corresponding probes 

18 from the first and second probe sets; 

19 wherein the probes in the first probe set have at least 

20 two interrogation positions respectively corresponding to each 

21 of two contiguous nucleotides in the reference sequence. 

1 2. An array of oligonucleotide probes immobilized on a 

2 solid support, the array comprising at least four sets of 

3 oligonucleotide probes, 

4 (1) a first probe set comprising a plurality of 

5 probes, each probe comprising a segment of at least three 

6 nucleotides exactly complementary to a subsequence of the 

7 reference sequence, the segment including at least one 

8 interrogation position complementary to a corresponding 

9 nucleotide in the reference sequence, 

10 (2) second, third and fourth probe sets, each 

11 comprising a corresponding probe for each probe in the first 

12 probe set, the probes in the second, third and fourth probe 

13 sets being identical to a sequence comprising the 

14 corresponding probe from the first probe set or a subsequence 
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15 of at least three nucleotides thereof that includes the at 

16 least one interrogation position, except that the at least one 

17 interrogation position is occupied by a different nucleotide 

18 in each of the four corresponding probes from the four probe 

19 sets. 

1 3. The oligonucleotide array of claim 2, further 

2 comprising a fifth probe set comprising a corresponding probe 

3 for each probe in the first probe set, the corresponding probe 

4 from the fifth probe set being identical to a sequence 

5 comprising the corresponding probe from the first probe set or 

6 a subsequence of at least three nucleotides thereof that 

7 includes the at least one interrogation position, except that 

8 the at least one interrogation position is deleted in the 

9 corresponding probe from the fifth probe set. 

1 4. The oligonucleotide array of claim 2, further 

2 comprising a sixth probe set comprising a corresponding probe 

3 for each probe in the first probe set, the corresponding probe 

4 from the sixth probe set being identical to a sequence 

5 comprising the corresponding probe from the first probe set or 

6 a subsequence of at least three nucleotides thereof that 

7 includes the at least one interrogation position, except that 

8 an additional nucleotide is inserted adjacent to the at least 

9 one interrogation position in the corresponding probe from the 
10 first probe set. 

1 5. The array of claim 2, wherein the first probe set has 

2 at least three interrogation positions respectively 

3 corresponding to each of three contiguous nucleotides in a 

4 reference sequence. 

1 6. The array of claim 2, wherein the first probe set has 

2 at least 50 interrogation positions respectively corresponding 

3 to each of 50 contiguous nucleotides in a reference sequence. 



1 
2 



7. The array of claim 1 or 2, wherein the first probe 
set has at least 100 interrogation positions respectively 
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3 corresponding to each of 100 contiguous nucleotides in a 

4 reference sequence. 



1 8. The oligonucleotide array of claim 1 or 2, wherein 

2 the first probe set has an interrogation position 

3 corresponding to each of at least 30% of the nucleotides in a 

4 reference sequence and the reference sequence comprises at 

5 least 100 nucleotides. 

1 9. The oligonucleotide array of claim 8, wherein the 

2 first probe set comprises probes which completely span the 

3 reference sequence, which probes relative to the reference 

4 sequence, overlap one another in sequence. 

1 10. The oligonucleotide array of claim 9, wherein the 

2 first probe set has an interrogation position corresponding to 

3 each of the nucleotides in the reference sequence. 

1 11. The oligonucleotide array of claim 10, wherein the 

2 probes are oligodeoxyribonucleotides. 

1 12. The oligonucleotide array of claim 1 or 2, wherein 

2 the array comprises between 100 and 10,000 probes. 

1 13. The oligonucleotide array of claim 1 or 2, wherein 

2 the array comprises between 10,000 and 100,000 probes. 

1 14. The oligonucleotide array of claim 1 or 2, wherein 

2 the array comprises between 100,000 and 10,000,000 probes. 

1 15. The oligonucleotide array of claim 1 or 2, wherein 

2 the probes are linked to the support via a spacer. 



1 16. The oligonucleotide array of claim 1 or 2, wherein 

2 the segment in each probe of the first probe set that is 

3 exactly complementary to the subsequence of the reference 

4 sequence is 9-21 nucleotides. 
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1 17. The oligonucleotide array of claim 16, wherein the 

2 segment is n nucleotides long, and the subsequence is at least 

3 n-2 nucleotides long. 

1 18. The oligonucleotide array of claim 1 or 2, wherein 

2 each probe of the first probe set consists of the segment that 

3 is exactly complementary to the subsequence of the reference 

4 sequence • 

1 19. The oligonucleotide array of claim 1 or 2^ wherein 

2 the probes in the second, third and fourth probe sets are 

3 identical to the corresponding probe from the first probe set 

4 except that the at least one interrogation position is 

5 occupied by a different nucleotide in each of the four 

6 corresponding probes from the four probe sets. 

1 20. The array of claim 2, further comprising fifth, 

2 sixth and seventh probe sets, wherein: 

3 the segment of each probe in the first set 

4 includes at least two interrogation positions each 

5 corresponding to a nucleotide in the reference sequence, 

6 the second, third and fourth probe sets, each 

7 comprise a corresponding probe for each probe in the first 

8 probe set, the corresponding probes in the second, third and 

9 fourth probe sets being identical to a sequence comprising the 

10 corresponding probe from the first probe set or a subsequence 

11 of at least three nucleotides thereof that includes a first 

12 interrogation position except that the first interrogation 

13 position is occupied by a different nucleotide in each of the 

14 four corresponding probes from the four probe sets; 

15 the fifth, sixth and seventh probe sets, each 

16 comprising- a corresponding probe for each probe in the first 

17 probe set, the probes in the fifth, sixth and seventh probe 

18 sets being identical to a sequence comprising the 

19 corresponding probe from the first probe set or a subsequence 

20 of at least three nucleotides thereof that includes a second 

21 interrogation position, except that the second interrogation 
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22 position is occupied by a different nucleotide in each of the 

23 four corresponding probes from the four probe sets. 

1 21. The array of claim 2, wherein each probe in the 

2 first probe set further comprises a second segment of at least 

3 three nucleotides exactly complementary to a second 

4 subsequence of the reference sequence, and the probes from the 

5 second, third and fourth probe sets comprise the corresponding 

6 probe from the first probe set or a subsequence thereof 

7 comprising the first and second segments except in the at 

8 least one interrogation position, 

1 22. The array of claim 2, further comprising: 

2 a fifth probe set comprising at least one probe 

3 comprising a segment of at least seven nucleotides exactly 

4 complementary to a subsequence of the reference sequence 

5 except at one or two positions, the segment including at least 

6 one interrogation position corresponding to a nucleotide in 

7 the reference sequence not at the one or two positions; 

8 sixth, seventh and eighth probe sets, each comprising a 

9 probe for each probe in the fifth probe set, the corresponding 

10 probes from the sixth, seventh & eighth probe sets being 

11 identical to a sequence comprising the corresponding probe 

12 from the fifth probe set or a subsequence of at least nine 

13 nucleotides thereof including the at least one interrogation 

14 position and the one or two positions, except in the at least 

15 one interrogation position, which is occupied by a different 

16 nucleotide in each of the four probes. 

1 23. The array of claim 2, wherein the probes are 

2 arranged on the substrate so that the first set of probes is 
3* arranged in a row across the substrate in an order reflecting 

4 the overlap between the probes and the reference sequence, and 

5 the additional sets of probes are arranged in columns relative 

6 to the probes in said first set, so that probes with the same 

7 interrogation position are in the same column and so that each 

8 column comprises at least 4 probes. 



PCTAJS94/12305 

WO 95/11995 

139 

1 24. The array of Claim 2, wherein said probes are 12 to 

2 17 nucleotides in length. 

1 25. The array of Claim 2, wherein said probes are 15 

2 nucleotides in length and attached by a covalent linkage to a 

3 site on a 3 '-end of said probes, and said interrogation 

4 position is located at position 7, relative to the 3 '-end of 

5 said probes. 

1 26. The array of claim 2, further comprises fifth, 

2 sixth, seventh and eighth probe sets, 

3 (1) a fifth probe set comprising a plurality of 

4 probes, each probe comprising a segment of at least three 

5 nucleotides exactly complementary to a subsequence of a second 

6 reference sequence, the segment including at least one 

7 interrogation position complementary to a corresponding 

8 nucleotide in the reference sequence, 

g (2) the sixth, seventh, and eighth probe sets, each 

10 comprising a corresponding probe for each probe in the fifth 

11 probe set, the probes in the sixth, seventh and eighth probe 

12 sets being identical to a sequence comprising the 

13 corresponding probe from the fifth probe set or a subsequence 

14 of at least three nucleotides thereof that includes the at 

15 least one interrogation position, except that the at least one 

16 interrogation position is occupied by a different nucleotide 

17 in each of the four corresponding probes from the fifth, 

18 sixth, seventh and eighth probe sets. 

1 27. The array of claim 22, wherein the first, second, 

2 third and fourth probe sets have proBes of a first length and 

3 the fifth, sixth, seventh and eight probe sets have probes of 

4 a second length different from the first length. 

Tiling for wildtype and mutant reference sequences 

1 28. An array of oligonucleotide probes immobilized on a 

2 solid support, the array comprising at least one pair of first 

3 and second probe groups, each group comprising a first and 

4 second sets of oligonucleotide probes as defined by claim 1; 
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5 wherein each probe in the first probe set from the 

6 first group is exactly complementary to a subsequence of a 

7 first reference sequence and each probe in the first probe set 

8 from the second group is exactly complementary to a 

9 subsequence from a second reference sequence, 

1 29. The array of claim 28, wherein the second reference 

2 sequence is a mutated form of the first reference sequence. 

1 ,30. The array of claim 28, wherein each group further 

2 comprises third and fourth probe sets, each comprising a 

3 corresponding probe for each probe in the first probe set, the 

4 probes in the second, third and fourth probe sets being 

5 identical to a sequence comprising the corresponding probe 

6 from the first probe set or a subsequence of at least three 

7 nucleotides thereof that includes the interrogation position, 

8 except that the interrogation position is occupied by a 

9 different nucleotide in each of the four corresponding probes 
10 from the four probe sets. 

1 31. The array of claim 30 that comprises at least five 

2 pairs of first and second probe groups, wherein the probes in 

3 the first probe sets from the first groups of the five pairs 

4 are exactly complementary to subsequences from five different 

5 respective first reference sequences. 

1 32. The array of claim 30 that comprises at least forty 

2 pairs of first and second probe groups, wherein the probes in 

3 the first probe sets from the first groups of the forty pairs 

4 are exactly complementary to subsequences from forty 

5 respective first reference sequences. 

Block tiling 

1 33. An array of oligonucleotide probes immobilized on a 

2 solid support, the array comprising at least a group of probes 

3 comprising: 

4 a wildtype probe comprising a segment of at least three 

5 nucleotides exactly complementary to a subsequence of a 



PCTAJS94/1230S 

WO 95/11995 

141 

6 reference sequence, the segment having at least first and 

7 second interrogation positions corresponding to first and 

8 second nucleotides in the reference sequence, 

9 a first set of three mutant probes, each identical to a 

10 sequence comprising the wildtype probe or a subsequence of at 

11 least three nucleotides thereof including the first and second 

12 interrogation positions, except in the first interrogation 

13 position, which is occupied by a different nucleotide in each 

14 of the three mutant probes and the wildtype probe; 

15 a second set of three mutant probes, each identical to a 

16 sequence comprising the wildtype probe or a subsequence of at 

17 least three nucleotides thereof including the first and second 

18 interrogation positions, except in the second interrogation 

19 position, which is occupied by a different nucleotide in each 

20 of the three mutant probes and the wildtype probe. 

1 34. The array of claim 33, wherein the segment of the 

2 wildtype probe comprises 3-20 interrogation positions 

3 corresponding to 3-20 respective nucleotides in the reference 

4 sequence, and the array comprises 3-20 respective sets of 

5 three mutant probes, each of the three probes identical to a 

6 sequence comprising the wildtype probe or a subsequence 

7 thereof including the 3-20 interrogation positions, except 

8 that one of the 3-20 interrogation positions is occupied by a 

9 different nucleotide in each of the three mutant probes and 

10 the wildtype probes, the one of the 3-20 interrogation 

11 positions being different in each of the 3-20 respective sets 

12 of three mutant probes. 

1 • 35. An array of probes immobilized to a solid support 

2 comprising two groups of probes, each group as defined by 

3 claim 33, a first group comprising a wildtype probe comprising 

4 a segment exactly complementary to a subsequence of a first 

5 reference sequence and a second group comprising a wildtype 

6 probe comprising a segment exactly complementary to a 

7 subsequence of a second reference sequence. 
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1 36. The array of claim 35, comprising at least 10-100 

2 groups of probes, each comprising a wildtype probe comprising 

3 a segment exactly complementary to a subsequence of at least 

4 10-100 respective reference sequences. 

Pooled probes 

1 37. A method of comparing a target sequence with a 

2 reference sequence, the method comprising: 

3 identifying variants of a reference sequence differing 

4 from the reference sequence in at lea'st one nucleotide; 

5 assigning each variant a designation, 

6 providing an array of pools of probes, each pool 

7 occupying a separate cell of the array, wherein each pool 

8 comprises a probe comprising a segment exactly complementary 

9 to each variant sequence assigned a particular designation, 

10 contacting the array with a target sequence comprising a 

11 variant of the reference sequence; 

12 determining the relative hybridization intensities of the 

13 pools in the array to the target sequence; 

14 determining the target sequence from the relative 

15 hybridization intensities of the pools. 

1 38. The method of claim 37, wherein the variants are 

2 assigned numbers according to an error code. 

1 39. The method of claim 37, wherein each variant is 

2 assigned a designation having at least one digit and at least 

3 one value for the digit, and each pool comprise a probe 

4 comprising a segment exactly complementary to each variant 

5 sequence assigned a particular value in a particular digit. 

1 40. The method of claim 39, wherein the variants are 

2 assigned successive numbers in a numbering system of base m 

3 having n digits, and the array comprises n x (m-l) pools of 

4 probes . 
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1 41, The method of claim 40, wherein each pool further 

2 comprises a probe comprising a segment exactly complementary 

3 to the reference sequence. 

Trellis tiling 

1 42. A pooled probe comprising a segment exactly 

2 complementary to a subsequence of a reference sequence except 

3 at a first interrogation position occupied by a pooled 

4 nucleotide N, a second interrogation position occupied by a 

5 pooled nucleotide selected from the group of three consisting 

6 of (1) M or K, (2) R or Y and (3) S or W, and a third 

7 interrogation position occupied by a second pooled nucleotide 

8 selected from the group, wherein the pooled nucleotide 

9 occupying the second interrogation position comprises a 

10 nucleotide complementary to a corresponding nucleotide from 

11 the reference sequence when the second pooled probe and 

12 reference sequence are maximally aligned, and the pooled 

13 nucleotide occupying the third interrogation position 

14 comprises a nucleotide complementary to a corresponding 

15 nucleotide from the reference sequence when the third pooled 

16 probe and the reference sequence are maximally aligned, 

17 wherein N is A, C, G or T(U), K is G or T(U) , M is A or C, R 

18 is A or G, Y is C or T(U) , W is A or T(U) and S is G or C. 

1 43. An array of oligonucleotide probes immobilized on 

2 solid support, the array comprising: 

3 first, second and third cells respectively occupied by 

4 first, second and third pooled probes, each pooled probe 

5 comprising a segment exactly complementary to a subsequence of 

6 a reference sequence except at a first interrogation position 

7 occupied by a pooled nucleotide N, a second interrogation 

8 position occupied by a pooled nucleotide selected from the 

9 group of three consisting of (1) M or K, (2) R or Y and (3) S 

10 or W, and a third interrogation position occupied by a second 

11 pooled nucleotide selected from the group, wherein the pooled 

12 nucleotide occupying the second interrogation position 

13 comprises a nucleotide complementary to a corresponding 

14 nucleotide from the reference sequence when the pooled probe 
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15 and the reference sequence are maximally aligned, and the 

16 pooled nucleotide occupying the third interrogation position 

17 comprises a nucleotide complementary to a corresponding 

18 nucleotide from the reference sequence when the pooled probe 

19 and the reference sequence are maximally aligned; 

20 provided that one of the three interrogation 

21 positions in the each of the three pooled probes is aligned 

22 with the same corresponding nucleotide in the reference 

23 sequence, this interrogation position being occupied by an N 

24 in one of the pooled probes, and a different pooled nucleotide 

25 in each of the other two pooled probes, 

26 wherein N is A, C, G or T(U) , K is G or T(U) , M is A 

27 or C, R is A or G, Y is C or T(U) , W is A or T(U) and S is G 

28 or C. 

1 44. The array of claim 43 further comprising: 

2 fourth and fifth cells respectively occupied by fourth 

3 and fifth pooled probes, each pooled probe as defined by 

4 claim 43, 

5 wherein one of the three interrogation position in the 

6 second, third and fourth pooled probes is aligned with the 

7 same corresponding nucleotide in the reference sequence, this 

8 interrogation position being occupied by an N in one of the 

9 pooled probes, and a different pooled nucleotide in each of 

10 the other two pooled probes, 

11 wherein one of the three interrogation position in the 

12 third, fourth and fifth pooled probes is aligned with the same 

13 corresponding nucleotide in the reference sequence, this 

14 interrogation position being occupied by an N in one of the 

15 pooled probes, and a different pooled nucleotide in each of 

16 the other two pooled probes. 

1 45. The array. of claim 44, wherein the pooled probes are 

2 identical except at the interrogation positions. 

1 46. The array of claim 44, wherein the first, second, 

2 third, fourth and fifth pooled probes are exactly 

3 complementary to five respective subsequences of the reference 
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4 sequences that from each other by increments of one 

5 nucleotide. 



Bridge tiling 

1 47. An array of oligonucleotide probes immobilized on a 

2 solid support, the array comprising at least four probes: 

3 a first probe comprising first and second segments, each 

4 of at least three nucleotides and exactly complementary to 

5 first and second subsequences of a reference sequences, the 

6 segments including at least one interrogation position 

7 corresponding to a nucleotide in the reference sequence, 

8 wherein either (1) the first and second subsequences are 

9 noncontiguous, or (2) the first and second subsequences are 

10 contiguous and the first and second segments are inverted 

11 relative to the complement of the first and second 

12 subsequences in the reference sequence; 

13 second, third and fourth probes, identical to a sequence 

14 comprising the first probe or a subsequence thereof comprising 

15 at least three nucleotides from each of the first and second 

16 segments, except in the at least one interrogation position, 

17 which differs in each of the probes. 

1 48. The array of claim 47, wherein the first and second 

2 subsequences are separated by one or two nucleotides in the 

3 reference sequence. 

Two interrogation positions (no wildtype) 

1 49. An array of oligonucleotide probes immobilized on a 

2 solid support, the array comprising at least a set of four 

3 probes, each of the probes comprising a segment of at least 7 

4 nucleotides that is exactly complementary to a subsequence 

5 from a reference sequence, except that the segment may or may 

6 not be exactly complementary at two interrogation positions, 

7 wherein: 

8 the first interrogation position is occupied by a 

9 different nucleotide in each of the four probes, 

10 the second interrogation position is occupied by a 

11 different nucleotide in each of the four probes. 
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12 in first and second probes, the segment is exactly 

13 coiapleinentary to the subsequence, except at not more than one 

14 of the interrogation positions, and 

15 in third and fourth probes, the segment is exactly 

16 complementary to the subsequence, except at both of the 

17 interrogation positions. 

1 50. An array of probes immobilized to a support, the 

2 array comprising at least 100 sets of 4 probes, each set as 

3 defined by claim 49, the probes from the at least 100 sets 

4 comprising at least 100 respective segments, the segments 

5 having at least 100 respective first and second interrogation 

6 positions. 

Helper mutations 

1 51. An array of oligonucleotide probes immobilized on a 

2 solid support, the array comprising a set of probes 

3 comprising: 

4 a first probe comprising a segment of at least 7 

5 nucleotides exactly complementary to a subsequence of a 

6 reference sequence except at one or two positions, the segment 

7 including an interrogation position not at the one or two 

8 positions; 

9 second, third and fourth mutant probes, each identical to 

10 a sequence comprising the wildtype probe or a subsequence 

11 thereof including the interrogation position and the one or 

12 two positions, except in the interrogation position, which is 

13 occupied by a different nucleotide in each of the four probes. 

Omission of Perfectly Matched Probe 

1 52. An array of oligonucleotide probes immobilized on a 

2 solid support, the array comprising at least two sets of 

3 oligonucleotide probes, 

4 (1) a first probe set comprising a plurality of 

5 probes, each probe comprising a segment exactly complementary 

6 to a subsequence of at least 3 nucleotides of a reference 

7 sequence except at an interrogation position. 
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8 (2) a second probe set comprising a corresponding 

9 probe for each probe in the first probe set, the corresponding 

10 probe in the second probe set being identical to a sequence 

11 comprising the corresponding probe from the first probe set or 

12 a subsequence of at least three nucleotides thereof that 

13 includes the interrogation position, except that the 

14 interrogation position is occupied by a different nucleotide 

15 in each of the two corresponding probes and the complement to 

16 the reference sequence, 

17 wherein the probes in the first probe set have at 

18 least three interrogation positions respectively corresponding 

19 to each of three contiguous nucleotides in the reference 

20 sequence. 

Methods 

1 53 . A method of comparing a target nucleic acid with a 

2 reference sequence comprising a predetermined sequence of 

3 nucleotides, the method comprising: 

4 (a) hybridizing the target nucleic acid to an array 

5 of oligonucleotide probes immobilized on a solid support, the 

6 array comprising: 

7 (1) a first probe set comprising a plurality of 

8 probes, each probe comprising a segment of at least three 

9 nucleotides exactly complementary to a subsequence of the 

10 reference sequence, the segment including at least one 

11 interrogation position complementary to a corresponding 

12 nucleotide in the reference sequence, 

13 (2) a second probe set comprising a corresponding 

14 probe for each probe in the first probe set, the . corresponding 

15 probe in the second probe set being identical to a sequence 

16 comprising the corresponding probe from the first probe set or 

17 a subsequence of at least three nucleotides thereof that 

18 includes the at least one interrogation position, except that 

19 the at least one interrogation position is occupied by a 

20 different nucleotide in each of the two corresponding probes 

21 from the first and second probe sets; 

22 wherein, the probes in the first probe set have at 

23 least three interrogation positions respectively corresponding 
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to each of at least three nucleotides in the reference 

sequence, and 

(b) determining which probes, relative to one 
another, in the array bind specifically to the target nucleic 
acid, the relative specific binding of the probes indicating 
whether the target sequence is the same or different from the 
reference sequence. 

54. The method of claim 53, wherein the array further 
comprises third and fourth probe sets, each comprising a 
corresponding probe for each probe in the first probe set, the 
probes in the second/ third and fourth probe sets being 
identical to a sequence comprising the corresponding probe 
from the first probe set or a subsequence of at least three 
nucleotides thereof that includes the at least one 
interrogation position, except that the at least one 
interrogation position is occupied by a different nucleotide 
in each of the four corresponding probes from the four probe 
sets. 

55. The method of claim 54, wherein the target sequence 

2 has a substituted nucleotide relative to the reference 

3 sequence in at least one undetermined position, and the 

4 relative specific binding of the probes indicates the location 

5 of the position and the nucleotide occupying the position in 

6 the target sequence. 

1 56. The method of claim 54, wherein: 

2 the hybridizing step comprises hybridizing the 

3 target nucleic acid and a second target nucleic acid to the 

4 array; and 

5 .the determining step comprises determining which 

6 probes, relative to one another, in the array bind 

7 specifically to the target nucleic acid or the second target 

8 nucleic acid, the relative specific binding of the probes 

9 indicating whether the target sequence is the same or 

10 different from the reference sequence and whether the second 
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11 target sequence is the same or different from the reference 

12 sequence. 

1 57. The method of claim 56, wherein the target sequence 

2 has a label and the second target sequence has a second label 

3 different from the label. 

1 58. The method of claim 56, wherein undetermined first 

2 and second proportions of the first and second target 

3 sequences are hybridized to the array and the specific binding 

4 indicates the proportions. 

1 59. The method of claim 54, further comprising: 

2 (c) removing the target nucleic acid from the array; 

3 (d) hybridizing a second target nucleic acid to the 

4 array ; 

5 (e) determining which probes, relative to one another, in 

6 the array bind specifically to the second target nucleic acid, 

7 the relative specific binding of the probes indicating whether 

8 the second target sequence is the same or different from the 

9 reference sequence. 

1 60. A method of comparing a target nucleic acid with a 

2 reference sequence comprising a predetermined sequence of 

3 nucleotides, the method comprising: 

4 hybridizing the target sequence to the array of 

5 claim 28; 

6 determining which probes in the first group, 

7 relative to one another, hybridize to the target sequence, the 

8 relative specific binding of the probes indicating whether the 

9 target sequence is the same or different from the first 

10 reference sequence; 

11 determining which probes in the second group, 

12 relative to one another, hybridize to the target sequence, the 

13 relative specific binding of the probes indicating whether the 

14 target sequence is the same or different from the second 

15 reference sequence. 
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1 61. The method of claim 60, wherein the hybridizing step 

2 comprising hybridizing the target sequence and a second target 

3 sequence to the array, and the relative specific binding of 

4 the probes from the first group indicates that the target is 

5 identical to the first reference sequence, and the relative 

6 specific binding of the probes from the second group indicates 

7 that the second target sequence is identical to the second 

8 reference sequence. 

1 62. The method of claim 61, wherein the first and second 

2 target sequences are heterozygous alleles of a gene. 

Comparative hybridization 

1 63. A method of comparing a target nucleic acid with a 

2 reference sequence comprising a predetermined sequence of 

3 nucleotides, the method comprising: 

4 (a) hybridizing the reference sequence to an array 

5 of oligonucleotide probes immobilized on a solid support, the 

6 array comprising; 

7 (1) a first probe set comprising a plurality of 

8 probes, each probe comprising a segment of at least 3 

9 nucleotides exactly complementary to a subsequence of the 

10 reference sequence except in at least one interrogation 

11 position; 

12 (2) a second probe set comprising a corresponding 

13 probe for each probe in the first probe set, the corresponding 

14 probe in the second probe set being identical to a sequence 

15 comprising the corresponding probe from the first probe set or 

16 a subsequence of at least three nucleotides thereof that 

17 includes the at least one interrogation position, except that 

18 the at least one interrogation position is occupied by a 

19 different nucleotide in each of the two corresponding probes 

20 from the first and second probe sets; and 

21 (b) determining which probes, relative to one 

22 another, in the array bind specifically to the reference 

23 sequence; 

24 (c) hybridizing a target sequence to the array; 
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25 (d) determining which probes, relative to one 

26 another, in the array bind specifically to the target 

27 sequence; 

28 wherein the relative specific binding of the probes 

29 to the reference and the target sequence indicates whether the 

30 reference sequence is the same or different from the target 

31 sequence • 

1 64* The method of claim 63, wherein the reference 

2 sequence has a first label and the second reference sequence 

3 has a second label different from the first label, and steps 

4 (a) and (c) are performed simultaneously. 

HIV Chip 

1 65. The array of claim 2, wherein the reference sequence 

2 is from a human immunodeficiency virus. 

1 66. The array of claim 65, wherein the reference 

2 sequence is from a reverse transcriptase gene of the human 

3 immunodeficiency virus. 

1 67. The array of claim 66, wherein the reference 

2 sequence is from a protease gene of the human immunodeficiency 

3 virus . 

1 68. The. array of claim 66, wherein the reference 

2 sequence is a full-length reverse transcriptase gene. 

1 69. The array of claim 68 comprising at least 3200 

2 oligonucleotide probes. 

1 70. The array of claim 66, wherein the HIV gene is from 

2 the BRU HIV strain. 

1 71. The array of claim 66, wherein the HIV gene is from 

2 the SF2 HIV strain. 
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1 72. The array of claim 28, wherein the reference 

2 sequence is from the coding strand of a reverse transcriptase 

3 gene of a human immunodeficiency virus and the second 

4 reference sequence is from the noncoding strand of the reverse 

5 transcriptase gene. 

1 73. The array of claim 28, wherein the first reference 

2 sequence is from a reverse transcriptase gene of a human 

3 immunodeficiency virus and the second reference sequence 

4 comprises a subsequence of the first reference sequence with a 

5 substitution of at least one nucleotide. 



1 74. The array of claim 73, wherein the substitution 

2 confers drug resistance to a human immunodeficiency virus 

3 comprising the second reference sequence. 

1 75. The array of claim 28, wherein the first and second 

2 reference sequences are from a reverse transcriptase gene from 

3 first and second strains of a human immunodeficiency virus. 

1 76. The array of claim 28, wherein the first reference 

2 sequence is from a reverse transcriptase gene of a human 

3 immunodeficiency virus and the second reference sequence is 

4 from a 16S RNA, or DNA encoding the 16S RNA, from a pathogenic 

5 microorganism . 

1 77. The array of claim 28, wherein the first reference 

2 sequence is from a reverse transcriptase gene of a human 

3 immunodeficiency virus and the second reference sequence is 

4 from a protease gene of the human immunodeficiency virus. 

1 78. The method of claim 54, wherein the refere^nce 

2 sequence is from a human immunodeficiency virus. 

1 79. The method of claim 78, wherein the reference 

2 sequence is from a human immunodeficiency virus and the target 

3 sequence is from a second human immunodeficiency virus. 
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1 80. The method of claim 79, wherein the target sequence 

2 has a substituted nucleotide relative to the reference 

3 sequence in at least one undetermined position, and the 

4 relative specific binding of the probes indicates the location 

5 of the position and the nucleotide occupying the position in 

6 the target sequence, 

1 81. The method of claim 80, wherein the target sequence 

2 has a substituted nucleotide relative to the reference 

3 sequence in at least one position, the substitution conferring 

4 drug resistance to the human immunodeficiency virus, and the 

5 relative specific binding of the probes reveals the 

6 substitution. 



1 82. The method of claim 78, wherein: 

2 the hybridizing step comprises hybridizing the 

3 target nucleic acid and a second target nucleic acid, the 

4 second target sequence being from a reverse transcriptase gene 

5 of a third human immunodeficiency virus, to the array; and 

6 the determining step comprises determining which 

7 probes, relative to one another, in the array bind 

8 specifically to the target nucleic acid or the second target 

9 nucleic acid, the relative specific binding of the probes 

10 indicating whether the target sequence is the same or 

11 different from the reference sequence and whether the second 

12 target sequence is the same or different from the reference 

13 sequence. 

1 83. The method of claim 82, wherein the first target 

2 sequence has a first label and the second target sequence has 

3 a second label different from the first label. 

1 84. The method of claim 82, wherein undetermined first 

2 and second proportions of the first and second target 

3 sequences are hybridized to the array and the specific binding 

4 indicates the proportions. 
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CFTR Chip 

1 85 • The array of claim 2, wherein the reference sequence 

2 is from a CFTR gene. 

1 86. The array of claim 85, wherein the reference 

2 sequence is exon 10 of a CFTR gene, and said array comprises 

3 over 1000 oligonucleotide probes, 10 to 18 nucleotides in 

4 length . 

1 87. The array of claim 85, wherein said array comprises 

2 a set of probes comprising a specific nucleotide sequence 

3 selected from the group of sequences comprising: 

4 3 »-TTTATAXTAG; 

5 3«- TTATAGXAGA; 

6 3'- TATAGTXGAA; 

7 3 • - ATAGTAXAAA; 

8 3 • - TAGTAGXAAC ; 

9 3'- AGTAGAXACC; 

10 3 • - GTAGAAXCCA; 

11 3»- TAGAAAXCAC; and 

12 3<- AGAAACXACA; wherein each set comprises 4 probes, 

13 and X is individually A, G, C, and T for each set. 

1 88. The array of claim 85, wherein said group of 

2 sequences comprises: 

3 3 • -TTTATAXTAGAAACC ; 

4 3«- TTATAGXAGAAACCA; 

5 3 • - TATAGTXGAAACCAC ; 

6 3 » - ATAGTAXAAACCACA ; 

7 3 » - TAGTAGXAACCACAA ; 

8 3 • - AGTAGAXACCACAAA ; 

9 3 » - GTAGAAXCCACAAAG ; 

10 3«- TAGAAAXCACAAAGG; and 

11 3t- AGAAACXACAAAGGA; wherein each set comprises 4 

12 probes, and X is individually A, G, C, and T for each set. 

1 89. The array of claim 32, wherein the forty first 

2 reference sequences are from a CFTR gene. 
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1 90. The array of claim 89, wherein each of the forty 

2 first reference sequences includes a site of a mutation and at 

3 least one adjacent nucleotide. 



1 91. The array of claim 90, wherein each of the forty 

2 first reference sequences comprises at least five contiguous 

3 nucleotides from a CFTR gene. 



1 92. The array of claim 89, wherein at least one first 

2 reference sequence is a from the coding strand of the cystic 

3 fibrosis gene and at least one first reference sequence is 

4 from the noncoding strand of the CFTR gene. 



1 93. An array of oligonucleotide probes immobilized on a 

2 solid support, the array comprising at least a group of probes 

3 comprising: 

4 a wildtype probe exactly complementary to a subsequence 

5 of a reference sequence from a cystic fibrosis gene, the 

6 segment having at least five interrogation positions 

7 corresponding to five contiguous nucleotides in the reference 

8 sequence, 

9 a first set of three mutant probes, each identical to the 

10 wildtype probe, except in a first of the five interrogation 

11 positions, which is occupied by a different nucleotide in each 

12 of the three mutant probes and the wildtype probe; 

13 a second set of three mutant probes, each identical to 

14 the wildtype probe, except in a second of the five 

15 interrogation positions, which is occupied by a different 

16 nucleotide in each of the three mutant probes and the wildtype 

17 probe; 

18 a third set of three mutant probes, each identical to the 

19 • wildtype probe, except in a third of the five interrogation 

20 positions, which is occupied by a different nucleotide in each 

21 of the three mutant probes and the wildtype probe; 

22 a fourth set of three mutant probes, each identical to 

23 the wildtype probe, except in a fourth of the five 

24 interrogation positions, which is occupied by a different 
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25 nucleotide in each of the three mutant probes and the wildtype 

26 probe; 

27 a fifth set of three mutant probes, each identical to the 

28 wildtype probe, except in a fifth of the five interrogation 

29 positions, which is occupied by a different nucleotide in each 

30 of the three mutant probes and the wildtype probe. 

1 94, The array of claim 93 comprising first and second 

2 groups of probes, each group as defined by claim 93, the first 

3 group comprising a wildtype probe exactly complementary to a 

4 first reference sequence, and the second group comprising a 

5 wildtype probe exactly complementary to a second reference 

6 sequence, wherein the second reference sequence is a mutated 

7 form of the first reference sequence. 

1 95. The array of claim 94, wherein the first reference 

2 sequence is from a CFTR gene and the second reference sequence 

3 is a mutated form of the first reference sequence. 

1 96. The method of claim 56, wherein the target sequence 

2 and the second target sequence are from heterozygous alleles 

3 of a CFTR gene, 

P53 Chip 

1 97. The array of claim 2, wherein the reference sequence 

2 is a sequence from a p53 gene. 

1 98. The array of claim 2, wherein the reference sequence 

2 is from an hMLHl gene. 

1 99. The array of claim 2, wherein the reference sequence 

2 • is from an MSH2 gene. 

1 100. The array of claim 28, wherein the reference 

2 sequence is from a human P53 gene and the second reference 

3 sequence is from an hMLHl gene. 



1 



101. The array of claim 100, further comprising: 
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2 ninth, tenth, eleventh and twelfth probe sets, 

3 (1) ■ the ninth probe set comprising a plurality of 

4 probes, each probe comprising a segment of at least three 

5 nucleotides exactly complementary to a subsequence of a third 

6 reference sequence, the segment including at least one 

7 interrogation position complementary to a corresponding 

8 nucleotide in the third reference sequence, 

9 (2) the tenth, eleventh and twelfth probe sets, 

10 each comprising a corresponding probe for each probe in the 

11 ninth probe set, the probes in the tenth, eleventh and twelfth 

12 probe sets being identical to a sequence comprising the 

13 corresponding probe from the ninth probe set or a subsequence 

14 of at least three nucleotides thereof that includes the at 

15 least one interrogation position, except that the at least one 

16 interrogation position is occupied by a different nucleotide 
1-; in each of the four corresponding probes from the ninth, 

18 tenth, eleventh and twelfth probe sets. 

1 102. The array of claim 97, wherein the first probe set 

2 has at least 60 interrogation positions corresponding to at 60 

3 contiguous nucleotides from exon 6. 

1 103. The array of claim 98, wherein the reference 

2 sequence is exon 5 of a p53 gene, the probes are 17 

3 nucleotides long, and the first set of probes is exactly 

4 complementary to the reference sequence, and the at least one 

5 interrogation position is at position 7, relative to a 3 '-end 

6 of each probe, which 3 '-end is covalently attached to the 

7 substrate . 

Mitochondrial Chip 

1 104.- The array of claim 2, wherein the reference 

2 sequence is from a mitochondrial genome. 

1 105. The array of claim 104, wherein said reference 

2 sequence is a sequence of a D-loop region. 
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1 106. The array of claim 

2 full-length. 

1 107. The array of claim 

2 sequence is at least 90% of a 

3 genome. 
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105, wherein D-loop region is 

104, wherein said reference 
full-length mitochondrial 



1 

2 
3 



108. The array of claim 104, wherein the reference 
sequence is bounded by positions 16280 to 356 of the 
mitochondrial genome. 
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FIG. 5 : Tiled Array with Probes for the Detection 
of Point Mutations 



3 ' -CCGACTACAGTCGTT 
3 ' -CCGACTCCAGTCGTT 
3 ' -CCGACTCCAGTCGTT 
- ' - CCG ACTTCAGTCGTT. 



wo 95/11995 



PCT/US94/12305 



5/57 



C 9 ® C G - ^^^^ t^'^'^ 



Fig. 6 



wo 95/11995 



PCT/US94/12305 



o/o7 



c(A|fl@cOA fib^ (ro.v^ f>>i^ 

C IS A- T C Gr « j (Vt>UA s^co^^tf; 



X 



CA^lfclcCA \ i<^^ ^-'■'^^ 



^3 



Fig. 7 



wo 95/11995 



PCTAJS94/12305 



T '57 

' fit* •'^l ^2- 

T C 0 A T 



Fig. 8 



At -TOO c G-CGA-rC^ 

aGrGrQCCAT Probe f^"*- f^^^f 

^ CrC^QC CAT / ^""^^ 



wo 95/11995 



PCT/US94/12305 



3/ 57 



MM 

I j/7 

;7/fl ::::" 



5 S 



A 1 AA*IM*Jk * * 



A A A AA'tfM-A A A 



A A A AA»»»-A * * 



A A * *A * * * 



A AA A 



« AA « 



A AA * 



AA AA k mmo. 



AA AA A 



a'a AA AAAAAA AAAA AAAAAA AA- 



A A AA A A A A 



A A AA AAAAAA AAAA AAAAAA AA 



A AA A A A 



A A AA AAAAAA AAAA AAAAAA AA 



A A AA A A & A 



A A AA AAAAAA AAAA A***** 



A AAAA AAA A A AA 

c c . = , - ^ . 

A AAAa' AAA A A AA 



A AAAA AAA A A AA 



AAAAA AA AAAA AAAAAA A A > 



. a' ' AA 

z ooc s 

i A* AA 

Z 3CS Z 

I a' ' AA 



A AA A A A k 



A A AA A A A A 



A A AA A A A A 



A AA A A A 

;* ;* "cc Z GZ 

*A AA A A A 
*A AAA A A 



AAA A AAAA AAA A 



A AA AA AAAA AA AA A 



'a A AA A A A 4 AAAAAA . . . 

-* * *• s ; ; sec SIC 



AAA A 



AAAAAA AAAAAA * AAA *kA 



AAA AAAAAAAA 



A A A 



AAA A 



AAAAAA AAAAAA A AAA -A A 



AAAAAA AAAAAAAA 



AAA A 



\ ^AAAAAA* AAAA A*' A A AA • A A AAA AA A A * A . ^ . .! . . -^1 . 

C - HI. .... . 

A \ a" \ " a a AAA A* A >^ J^/^ J^^^^^J^^:,^ 

T •* **'.• - • * 3C5I53Cj - * 

* • Z Z Z IT 



AAAAAA AAA**» 



A AAAA A AA A 



•»"aA **AAAA* AA »^ *A A AAAAA *■ 



i * A AA A 



AAA AAAAA A A 



A AAAA A AA A- A A A 



AAAA *A* A A AAA A • 



Fig. 10 
Page 1 of 2 



wo 95/1 1995 PCTAJS94/12305 



A A A AA A XAA AAA AAAAA A A A AAAA A AA A 



AAAAAA AAAAAA AAAAA A A A AAXA A AA A 



A A AA AAAAJh AAAA AAAA * A *AAAA 



A A AA AAAAA AAAA AAAA A A AAAAA 



A A AA AAAAA AAAA AAAA A A AAAAA 



A A AA AAAAA AAAA AAAA A A AAAAA 



A AAAAA AA A AAAAAAAA A 



A A A 



A A A 



A AAAAA AA A AAAAAAAA A 



A AAAAA AA A AAAAAAAA 



A A A AA A AAAAA AA A AAAAAAAA A 



AAA A AAAAA AA 



AAA A AAAAA AA 



A A AAA AA A 



AAA A A AAA A A 



AAA 



A A AAA A A 



A AA A A AAA A * 



V7:......7v..7: . * * ** A AAA A * 

?t^.*.!t.^^.^"."t..".. * Xa" "a " A AAA a'* 

- zzz z z 'sc z 

^ A A ^ A A A Aa" "a A AAA *** 

AAAA AA AA A A ^ *A 



AA A A A A 



A A A A 



HMMk 
NMMK 



Fig. 10 
Page 2 of 2 



wo 95/11995 



PCT/US94/12305 






Fig. 11 



wo 95/11995 



PCTAJS94/12305 



> 

u. ■ 



-5^ 
u 

-3 



1 



Figure 12 
(Page 1 of 2) 



JUUUUUUU 1. tuo:;uuui:; < IbiacaBocc 



LI 

Is 



wUUUUUUU 
•JUUUUUUU 
OOOiaDIOICBO 

raq 01 a o okClQ 
n PL ,a a a a a tB 

•■•«■« 

• a « • « 

aaaaaaaa 
a aaaa aa o> 
««««•••« 

uuuuuuuo 
uuuuuuuu 
a aa aaaa a 



■ ■ • e 



t u o u u u u ; 

iaaaaaaar 
aaaaaaac: 
aaaaao ac 




00 

a 

LI 
<: 



.3 M ^ 4J 4J AiffU 

i'jA • fl a ■ « 
1^ • ■ • ■ • 

aaaaaaaa 

" o u 



Aaaaaaaa 



i 



Is 




o « ■ • a a.o^ 
o a e c 0 atlO 
aaaaaaaa 
aaaaaaaa 
aaaaaaaa 

. aaaaaaaa 



I J u u u u u c L 

I U U U U U U *J 

« a « ef^ e e c 
a a a ot^ a a c 
ace cc 

^ a « e 
JB • • 

« o « 

uuujuuu;. 

aeeaaec* 
aaaaaaac- 
oveeneae 
aaa a aaar 
«e«««aec 
uuuuuuu:. 



(-*■? -fen A* ajK^a^ a, 

* luouoaoL 




u^? 'aaaaaaac^ 

» 5>a a a 0 aj» c 
I a a a a a oCd r 
L" t 01 o o a a a a c< 

I *.' *j u u u u u 

I L U U Uhd U Lr 
L* o ■« U O U " * ■ 

* I a a a a^333 

IT taaao^aaa 



" I ao aaa a a(r^ 



I aaaaaaaa: 
Ids a a a a a ^ 



o e tf 



aaaaaaaa 

ja • • •! 

Cju u uC 
0 s • aE 

4J A> Ait- 

aaao^ 

w M A« jj ~ 

0 0 0 0__ 

ia i* tJ «if2 



at»an ■ 

aaac6i 

aaaaau^ 

*t *t 4* At M M i 
0 0 0 0 0 0 0 1 

aaaaaaai 
0 0 0 0 0 0 • I 

u titi V o V u i 

U O I 

!•<*!-■ 
BOO 

|a« A< «J 

l«< A« A> 

,_-^aac 

m m m m ^ 

B«u u u u u u 

M ^t^JJ At *J 
O0000000 
0 O 0 0000c 
0 0E1 0 0 0 0 0 

a a a a a a a a 'i 

oaS a aaa o ^ 
aaaot^aao 



www0S00k 

aaaaaaaa 

jl 





*i M M i> iSrlA^ A. 

u u u "Wlngi^t 



000«000C 



aaa a 3 aaa 

000 00 0 I 



0 0 0 0 0 0 0 1 
00000001 
'0000000«: 
0000000c 
0 0 0 0 0 00 O 
0 0 0 0 0 0Q O 

aaaaaaaa 

~000000C 

^0000000 

aaaaaaaa 

0O00000Q 

UUUUUUUU 
0000000c 

aaaaaaaa 



00 



0 0 • • ^ 



\iJ li 

0000 
0000 

O U U ' 

aaa 



0000'' ^ec 
0000^2 -^fi. 

M ^ 4J A> ^ ^ 

M 4J JJ 4J " V -Vl «, 

aaao 2 ^aa 
, e o 0 0 K»H 0 c 
^ooeoooc 




u u u u u u u *. 
aaaaaaaa 

O00oo««r 

aaaaaaac- 

aaaaaaac 
aaaaaaao 
aaaaaaar 



< •ooooooec 

< loooceooc 

0*-e*LrUUUOUUL- 
t«O0oa0ac 

Ci laaaaaaaa 
laooooooc 

1 'J U U U U U U 




£S.Ii2.£.*' Ibiooooooc 



-wCC 

aaaaaaa r 
aaaaaaar 

4J ^ 4J 'At^ ^ ^ 

0000 g • 

aaaa 3^ a a 

« 0 0 0 'gft-^a c 

u u u ubi u u 
oaoooooc 

~O0000OC 
*J A* U-*J AJ ^ 

^aaaaaaa 
0S0000ec 

O00O00CC 
00O0O0OC 

O0000«ec 



KJT) V W V S V Kl C 

I0CCCCOOC 

M I W M «rf A* ^ 

<| tocooccoe 

c jaaaaaaaa 

^ (aaaaaaac 

tri (aaaaaaac- 

0 locooacec 

Ul tUUUUUUUL 

L»o^uuuuuuuu 
"'^ )00O00Ooe 

lOUUUUUUL 

— ioeo0oo0< 

LTI iaaaaaaaa 

<l leeoaocec 

Z\ Iaaaaaaaa 

<t leocoeooc 

ICOOOOOOC 

]0 V U U XJ U V ' 



aaaaaatn 

.00000000 

ieccoooof 
uuuuuucw 
00000000 

laaaoaaacn 
oocooooO 

UL'UUUULU 

u u o u jLj UL- y 

M A« 

000 ofaoo^" 

000 ot 

aaar^ 



a '> ''< e c ^. 





L» 0 0 0^ 00c 

I arf M A« AJ M A« 



UUUUUUUL 
00000c 

0000000c 

00000000 

a a aaaaa a 
aa aatraaa 
o B o aM s o c 

■W M 41 Ai f*tM*J--4 ^ 

"^^MJ I 



aaac 



Iti'ioeooooe 
oooeoeoc 
ooooeooe 



aaaaaaar 
ooooooec 
oaoooooc 
aaaaaaar 
aaaaaaar 
aaaaaaar 

•> — 4>> WfStA^ M> — 

UUUUUUL'L 




000c 




- w U U U U 

a aaaa aaff** 
000000OO 
aaaaaaim 
000000c 0 



.UUuuuucZ 



I o o o fl 1^ e e c 
1000 e l« i H*J*J»; ^ 

• — ** M » 

1000 oi^'t 00c 



oooooooc 

u u u u u u u 
ooooooec 
aaaaaaar 

oooooooc 

u u u u u u » *. 
aaaoaaa; 
0000000!; 

aaaaaaar 
oooooooc 
oooooooc 

TOOOOOOC 

3 u u u^ u u 
3 u u vfa u u *. 

(So 0 obto 0 r 

oooooooc 
ooooooor: 

*. L U U U U f 

aaaaaaar 

OOOOOOOT 

V « u u u L « 
L- L* u u u ;/ u *. 
«k04ooeeoecc 
L* T LT w u i . 

H*» I — ^ « • 

eooeooo*; 
aaaaaaar 
a a a aaar r 

< EEj o cq^ee:' 

< ooooooo': 

< o o o o 0 e ell- 
'J. aaaaaaar 



r« r« n r* ra r« rt IN 



ooooooec 



wo 95/11995 PCTAJS94/12305 

\ :•. ..' 7 

Figure 12 
(Page 2 of 2) 



e « a «^ « « c 

« * ' *EKitil^ 

u o o orwyH 

uuuuuuuu 
uuuuuuuu 
eaai 
u u u I 
e a « I 

U U U I 
U U U I 

a • a I 

d a « = 
aooicxacicvo 

aooonacB 9 

oDaema no 
e « ■ • • • 




>4U'JUUUUU*. 
> I CLC O « « 0 « C 



fl « « « « • 

OO aOOBB P 

aDnaaoBD 
oaaaoaa a 

aaaaaaaa 

uuuuuuuu 
aaaaaaaa 

a • • ■ * ■ 

u u u u u u@Ca 
a a a a a a g b 
oaaaaaan 
aaaaaaaa 

• a a a a a an 

paaaaaaa 
««aaaaae 

X> <*> W 

&U U U U U U U 

■ « • • * m 

aaaaaaaa 

U *> *i *J ****** ** 
UUUUUUUU 

A« «A4J A* 

aaaaaaaa 

aaaaaaaa 
aaaaaaaa 
easaaaae 

UiiU********** 

aaaaaaaa 

Aaaaaaaa 

SaaD@aaa 
yA*j Mi M ****** ** 

4J iJ ^ *t** *i ** 




UL - 

» 41 AJ 4J «J 

rr^mm a@ a a e 

^UUUUUU'w 
UUUUUUUU 
u ****** ** 
U U U U u u u 
^a a a0a a e 
loaeeeaec 
aaaaaaatr 

^^^*i*i****** 

M^^A*********** 
U ************ ** 

aaaaaaaa 
aaaaaaaa 
aa aaaaoa- 

U ****** *' 

a a a a « 

aaaaaaaa 
a a a 




mrs*** ****** 
t) U U U U 

u U U U u 
a a a a a 
a a a a a 
aa a aa 
a a a a a 
0 n 0i m n 
a a a a a 
oaa ao 
n « m 9 9 
u u u u u 
*j M *t ** ** 
a a a a a 
u u u u u 
a a e a a 
a a m)rA 




a a aa a^a a d 
aaaoaaa 
a a a caan a, 
ij ** ** **\^£i**\ 
a a a a aa a . 
aaaaaaaa 

WAJ 4i aim ****** 
u ****** ** ** ** *L\ 

aoaaaaaa 
aaaaaaaa^ 
a a a « a a a a 

aaaaaaaa 

uuuuuuuu 
a a a a a a a a 

i* A* ** ** **fSi** *> 

uuuuuuuu 
a a a a a a 

asa 




J u u u u u u 

J ii M*U* ** ** *±i 

9oaaaac j 

^ ^ A* ** ** *^ ^ 



Cf cf IT ^ ty tr IT cr 
uuuuuuu;^ 



00000000 



f* r* St It E 
oooooooc - 



a I? t? V 17 t7 » C 
C U U U w U U 

««« * iTun m 
00000000 

CD 00 CD B D O CD CC 



OOOOOOOC 
>>>.>>.>>> 



a 17 w (7 c cr 
u u u u u u u 

OOOOOOOC 
OOOOBCBOIDS 



OOOOOOOC 



wo 95/11995 



PCTAJS94/12305 







El 




>s 10 CO to 4J 


+-> 


1. L. L. 


"O 


M% #1% Ml Ml 
CU (U u; w O 




*~ 1 1 

E E E E -M 




rn LD Oi 3 








1 — 




<c 




<: 




<f 
















in 




l_ 


< 


< 


< 


< 


















r~ 


PT 1 


f ^ 


W W W V-/ Vw^ 




-r^ ' *^ 




rrv rr^ m rri i j 




•T^ n-^ -T-^ "r^ 




A A A A 
• • • • 




1) 11 11 IS ^< 
•T^ "T^ "T^ "T^ 




-r^ -T^ -r^ 




UJ .iJ ^ 

•t^ "T^ "T^ "T^ 


< 


4J 4J 4-1 4J < 


< 


4J 4J 4J 4J < 


< 


4J 4J 4J 4J < 




U U U 


< 


-M <: 


< 




< 


< 

1 


1 


c 


♦i-i 




cu 




u 


U 


I/) 


10 




. • • > O) 




m m no m 


o 


o 


3 


LU LU UJ UJ D 


1— ' 


CO CO CQ CO I—* 


LL. 


O O O O u- 




d: cr: - 


LO 


Q. C_ C_ D_ LH 



Fig. 13 



wo 95/11995 



PCTAJS94/12305 



15/57 



t 

3? 



ft 



a 
ii 




I 



7 




_ * : 



7 



• • « • m m m 

0 a ****** 
ououoooo 

OOOOODOV 

oouoeovo 

OOOBOOO 



i4 



o 
e 



••«••••• 

OO oooooo 
ooooeeoe 



1 aaaaaaaa 



*>•« 

uvvdtiuuo - 
So u i Bu u u 

Wmmm « a • • 
m m • « 

u V ulSu u o 

« 4« • « « 

41 «« au 

4* 44 41 A 

. ,ou u u b 

jr« mm mm m 





1-^ 



rv 
o 



^yysL — 

« « 

« « 

oooopo ou 
uooouoou 

m « • tfM « « « 

ill 




CfWt 

vo uc 



o oooo 

a aaaa 



ooo 



3 OOO 

aaaa 



wo 95/11995 



PCTaJS94/12305 



Array Design for the R553X Point Mutation 



Wild-Type Pattern 



Position 

n>2 n-1 n n+1 04-2 
Base iw miw miW.m|W miw mi 

a' 



G 
T 

I ...cNaCgag... 

...cNaTgag... 
u^t 2_ ...caNCgag... 

...caNTgag... 
u/(-3 ..xaaKgag... 



wmsmimmmm 



J 



i 



.^caaTgNg.. '^3 
-.caaCgiNg— w ^ 
>.caaTNag... 
—caaCNag^ S 
...caaNgag^ tM ^ 



Exact complement 



Single base*pair mismatcli 



Wild-Type Sequence: 5'-AGGTCAaCgaGCAA-3' 
Mutant Sequence: 5"-AGGTCAATgaGCAA-3* 



Fig. 16 
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Array Design for the R5S3X Point Mutation 

Hetcrozygote Pattern 

Position 
n-2 n-1 n n+1 n+2 
Base jw^mj>w^mjWjni |W jin |W m I 



A 








C 










G 










T 











.•.cNaCgag,.. 
...cNaTgag... 
..xaNCgag*.. 
...caMTgag... 
...caaNgag..* 



J 




-.caaTgNg-. 

-.caaTNag™ 
..caaCNag.. 
...caaNgag^ 



Wild-Type Sequence: 5'-AGGTCAACgaGCAA-3' 
Mutant Sequence: 5'-AGGTCAATGAGCAA-3' 



Fig. 17 
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Group II is directed to methods of using Group 11 is directed to the use of arrays of pooled probes whereas Group III is 
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directed only to a pool of probes which may have many other uses and also contain limitations therein to specific 
positions in probes in the pool which arc not recited in Group 11. Thus* Groups H and HI lack unity of invention in not 
containing the same special technical feature for probes in a pool or pools therein. Groups IV-VII all are directed to 
arrays similar to that cited in Group I but are directed to completely different specific reference genes. Therefore 
Groups IV-Vn lack unity of invention with Groups II and III for the same reasons as discussed above regarding Group 
I. Additionally the completely different and totally unrelated specific references genes cited in Groups IV-VII therefore 
arc directed to a different specific reference gene which is deemed the special technical feature of these Groups when 
each of Groups I and IV-VII are compared to any other Group therein. In summary the claims arc not so linked by a 
special technical feature within the meaning of PCT Rule 13.2 so as to form a single inventive concept. 
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(57) Abstract 

Methods are provided for detecting and quantitating gene sequences, such as mutated genes and oncogenes, in biological 
fluids. The fluid sample (e.g., plasma, serum, urine, etc.) is obtained, deproteinized and the DNA present in the sample is extract- 
ed. Following dcnaturation of the DNA, an amplification procedure, such as PCR or LCR, is conducted to amplify the mutated 
gene sequence. 
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DETECTION OF GENE SEQUENCES IN BIOLOGICAL FLUIDS 



Government Support 

The research leading to this invention was 
supported by government funding pursuant to NIH Grant 
No. CA 47248. 

Background of the Invention 

Soluble DNA is known to exist in the blood 
of healthy individuals at concentrations of about 
5 to 10 ng/ml* * It is believed that soluble DNA is 
present in increased levels in the blood of 
individuals having autoimmune diseases^ particularly 
systemic lupus erythematosus (SLE) and other diseases 
including viral hepatitis, cancer and pulmonary 
embolism. It is not known whether circulating 
soluble DNA represents a specific type of DNA which 
is particularly prone to appear in the blood. 
However, studies indicate that the DNA behaves as 
double-stranded DNA or as a mixture of 
double-stranded and single-stranded DNA, and that it 
is likely to be composed of native DNA with 
single-stranded regions. Dennin, R.H., Klin. 
Wochenschr . 57:451-456, (1979). Steinman, C.R., 
Clin. Invest. . 73:832-841, (1984). Fournie, G.J. et 
al.. Analytical Biochem. 158:250-256, (1986). There 
is also evidence that in patients with SLE, the 
circulating DNA is. enriched for human repetitive 
sequence (Alu) containing fragments when compared to 
normal human genomic DNA. 
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In patients with cancer, the levels of 
circulating soluble DNA in blood are significantly- 
increased. Types of cancers which appear to have a 
high incidence of elevated DNA levels include 
pancreatic carcinoma, breast carcinoma, colorectal 
carcinoma and pulmonary carcinoma. In these forms of 
cancer, the leyels of circulating soluble DNA in 
blood are usually oyer 50 ng/ml, and generally the 
mean values are more than 150 ng/ml. Leon et al.. 
Can . Res , 37:646-650, 1977; Shapiro et al., Canggr 
51:2116-2120, 1983. 

Mutated oncogenes have been described in 
experimental and human tumors. In some instances 
certain mutated oncogenes are associated with 
particular types of tumors. Examples of these are 
adenocarcinomas of the pancreas, colon and lung which 
have approximately a 75%, 50%, and 35% incidence 
respectively, of Kirsten ras (K-ras) genes, with 
mutations in positions 1 or 2 of codons 12. The most 
frequent mutations are changes from glycine to valine 
CGGT to GTT) , glycine to cysteine (GGT to TGT) , and 
glycine to aspartic acid (GGT to GAT). Other, but 
less common mutations of codon 12 include mutations 
to AGT and CGT. K-ras genes in somatic cells of such 
patients are not mutated. 

The ability to detect sequences of mutated 
oncogenes or other genes in small samples of 
biological fluid, such as blood plasma, would provide 
a useful diagnostic tool. The presence of mutated 
K-ras gene sequences in the plasma would be 
indicative of the presence in the patient of a tumor 
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which contains mutated oncogenes. Presumably this 
would be a specific tumor marker since there is no 
other known source of mutated K-ras geries. 
Therefore, this evaluation may be useful in 
suggesting and/or confirming a diagnosis. The amount 
of mutated K-ras sequences in the plasma may relate 
to the size of the tumor, the growth rate of the 
tumor and/or the regression of the tumor. Therefore, 
serial quantitation of mutated K-ras sequences may be 
useful in determining changes in tumor mass. Since 
most human cancers have mutated oncogenes, evaluation 
of plasma DNA for mutated sequences may have very 
wide applicability and usefulness. 

Summary O f The Invention 

This invention recognizes that gene 
sequences (e.g., oncogene sequences) exist in blood, 
and provides a method for detecting and quantitating 
gene sequences such as from mutated oncogenes and 
other genes in biological fluids, such as blood 
plasma and serum. The method can be used as a 
diagnostic technique to detect certain cancers and 
other diseases which tend to increase levels of 
circulating. soluble DNA in blood. Moreover, this 
method is useful in assessing the progress of 
treatment regimes for patients with certain cancers. 

The method of the invention involves the 
initial steps of obtaining a sample of biological 
fluid {e.g., urine, blood plasma or serum, sputum, 
cerebral spinal fluid) , then deproteinizing and 
extracting the DNA. The DNA is then amplified by 
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techniques such as the polymerase chain reaction 
(PGR) or the ligase chain reaction (LCR) in an allele 
specific manner to distinguish a normal gene sequence 
from a mutated gene sequence present in the sample. 
In one embodiment where the location of the mutation 
is known ^ the allele specific PGR amplification is 
performed using four pairs of oligonucleotide 
primers. The four primer pairs include a set of four 
allele specific first primers complementary to the 
gene . sequence contiguous with the site of the 
mutation on the first strand. These four primers are 
unique with respect to each other and differ only at 
the 3 ' nucleotide which is complementary to the wild 
type nucleotide or to one of the three possible 
mutations which can occur at this known position. 
The four primer pairs also include a single common 
primer which is used in combination with each of the 
four unique first strand primers* The common primer 
is complementary to a segment of a second strand of 
the DNA, at some distance from the position of the 
first primer. 

This amplification procedure amplifies a 
known base pair fragment which includes the 
mutation. Accordingly, this technique has the 
advantage of displaying a high level of sensitivity 
since it is able to detect only a few mutated DNA 
sequences in a background of a lo'^-fold excess of 
normal DIJA. The method is believed to be of much 
greater sensitivity than methods which detect point 
mutations by hybridization of a PGR product with 
allele specific radiolabelled probes which will not 
detect a mutation if the normal DNA . is in more than 
20-fold excess. 
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The above embodiment is useful where a 
mutation exists at a known location on the DNA. In 
another embodiment where the mutation is known to 
exist in one of two possible positions, eight pair of 
oligonucleotide primers may be used. The first set 
of four primer pairs (i.e., the four unique, allele 
specific primers, each of which forms a pair with a 
common primer) is as described above. The second set 
of four primer pairs comprises four allele specific 
primers complementary to the gene sequence contiguous 
with the site of the second possible mutation on the 
sense strand. These four primers are unique with 
respect to each other and differ at the terminal 3* . 
nucleotide which is complementary to the wild type 
nucleotide or to one of the three possible mutations 
which can occur at this second known position. Each 
of these allele specific primers is paired with 
another common primer complementary to the other 
strand, distant from the location of the mutation. 

The PGR techniques described above 
preferably utilize a DNA polymerase which lacks 
3 'exonuclease activity and therefore the ability to 
proofread. A preferred DNA polymerase is Thermus 
aouaticus DNA polymerase. 

During the amplification procedure, it is 
usually sufficient to conduct approximately 30 cycles 
of amplification in a DNA thermal cycler. After an 
initial denaturation period of 5 minutes, each 
amplification cycle preferably includes a 
denaturation period of about 1 minute at 95^C. , 
primer annealing for about 2 minutes at 58"C and an 
extension at 72**C for approximately 1 minute. 
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Following the amplification, aliquots of 
amplified DNA from the PGR can.be analyzed by 
techniques such as electrophoresis through agarose 
gel using ethidium bromide staining » Improved 
sensitivity may be attained by using labelled primers 
and subsequently identifying the amplified product by 
detecting radioactivity or chemiluminescense on 
film. Labelled primers may also permit quantitation 
of the amplified product which may be used to 
determine the amount of target sequence in the 
original specimen* 

As used herein, allele specific 
amplification describes a feature of the method of 
the invention where primers are used which are 
specific to a mutant allele, thus enabling 
amplification of the sequence to occur where there is 
100% complementarity between the 3* end of the primer 
and the target gene sequence. Thus, allele specific 
amplification is advantageous in that it does not 
permit amplification unless there is a mutated 
allele. This provides an extremely sensitive 
detection technique. 

Brief Description Of The Drawings 

Figures lA and IB are diagramatic 
representations of the amplification strategy for the 
detection of a mutated K-ras gene with a mutation 
present at a single known location of K-ras* 
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Figures 2A and 2B are diagramatic 
representations of the amplification strategy for 
detection of a mutated K-ras gene with a mutation 
present at a second of two possible locations of 
K-ras, 

Detailed Description of T he Invention 

The detection of mutated DNA, such as 
specific single copy genes, is potentially useful for 
diagnostic purposes, and/or for evaluating the extent 
of a disease. Normal plasma is believed to contain 
about 10 ng of soluble DNA per ml. The concentration 
of soluble DNA in blood plasma is known to increase 
markedly in individuals with cancer and some other 
diseases. The ability to detect the presence of 
known mutated gene sequences, such as K-ras gene 
sequences, which are indicative of a medical 
condition, is thus highly desirable. 

The present invention provides a highly 
sensitive diagnostic method enabling the detection of 
such mutant alleles in biological fluid, even against 
a background of as much as a lO'^-fold excess of 
normal DNA. The method generally involves the steps 
of obtaining a sample of a biological fluid 
containing soluble DNA, deproteinizing, extracting 
and denaturing the DNA, followed by amplifying the 
DNA in an allele specific manner, using a set of 
primers among which is a primer specific for the 
mutated allele. Through this allele specific 
amplification technique, only the mutant allele is 
amplified. Following amplification, various 
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techniques may be employed to detect the presence of 
amplified DNA and. to quantify the amplified DNA. The 
presence of the amplified DNA represents the presence 
of the mutated gene^ and the amount of the amplified 
gene present can provide an indication of the extent 
of a disease. 

This technique is applicable to the 
identification in biological fluid of sequences from 
single copy genes ^ mutated, at a known position on the 
gefie. Samples of biological fluid having soluble DHA 
(e.g., blood plasma, serumr urine, sputum^ cerebral 
spinai fluid) are collected and treated to 
deproteinize and extract the DNA. Thereafter , the 
DNA is denatured. . The DNA is then amplified in an 
allele specific manner so as to amplify the gene 
bearing a mutation. 

During deproteinization of DNA from the 
fluid sample, the rapid removal of piotein and the 
virtual simultaneous deactivation of any DNase is 
believed to be important. DNA is deproteinized by 
adding to aliquots of the sample an equal volume of 
20% NaCl and then boiling the mixture for about 3 to 
4 minutes. Subsequently, standard techniques can be 
used to complete the extraction and isolation of the 
DNA. A preferred extraction process involves 
concentrating the amount of DNA in the fluid sample 
by techniques such as centrifugation. 

The use of the 20% NaCl solutionv followed 
by boiling^ is believed to rapidly remove protein and 
simultaneously inactivate any DNases present. DNA 
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present in the plasma is believed to be in the form 
of nucleosomes and is thus believed to be protected 
from the DNases while in blood • However^ once the 
DNA is extracted, it is susceptible to the DNases. 
Thus, it is important to inactivate the DNases at the 
same time as deproteinization to prevent the DNases 
from inhibiting the amplification process by reducing 
the amount of DNA available for amplification. 
Although the 20% NaCl solution is currently 
preferred, it is understood that other concentrations 
of NaCl, and other salts, may also be used. 

Other techniques may also be used to extract 
the DNA while preventing the DNases from affecting 
the available DNA. Because plasma DNA is believed to 
be in the form of nucleosomes (mainly histones and 
DNA), plasma DNA could also be isolated using an 
antibody to histones or other nucleosomal proteins. 
Another approach could be to pass the plasma (or 
serum) over a solid support with attached antihistone 
antibodies which would bind with the nucleosomes. 
After rinsing the nucleosomes can be eluted from the 
antibodies as an enriched or purified fraction. 
Subsequently, DNA can be extracted using the above or 
other conventional methods. 

In one embodiment, the allele specific 
amplification is performed through the Polymerase 
Chain Reaction (PCR) using primers having 3* terminal 
nucleotides complementary to specific point mutations 
of a gene for which detection is sought. PCR 
preferably is conducted by the method described by 
Saiki, "Amplification of Genomic DNA", PCR Protocols. 
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Eds. M.A. Innis, et al^r Academic Press, San Diego 
(1990), pp. 13. In addition, the PCR is conducted 
using a thermostable DNA polymerase which lacks 3* 
exonuclease activity and therefore the ability to 
repair single base mismatches at the 3* terminal 
nucleotide of the DNA primer during amplification. 
As noted, a preferred DNA polymerase is I. gqugtiCtiS 
DNA polymerase. A suitably L. aqtfgticus DNA 
polymerase is commercially available from 
Perkin-Elmer as AmpliTaq DNA polymerase. Other 
useful DNA polymerases which lack 3* exonuclease 
activity include a Ventji (exo->, available from New 
England Biolabs, Inc., (purified from strains of E^ 
coll that carry a DNA polymerase gene from the 
archaebacterium T^grmneoccus litflxalis) , Hot Tub DNA 
polymerase derived from Thermus fl9VV5 and available 
from Amersham Corporation, and Tth DNA polymerase 
derived form Thermus frht»rmoDhi lus. available form 
Epicentre Technologies, Molecular Biology Resource 
Inc., or Perkin-Elmer Corp. 

This method conducts the amplification using 
four pairs of oligoucleotide primers. A first set of 
four primers comprises four allele specific primers 
which are unique with respect to each other. The 
four allele specific primers are each paired with a 
common distant primer which anneals to the other DNA 
strand distant from the allele specific primer. One 
of the allele specific primers is complementary to 
the wild type allele (i.e.,. is allele specific to the 
normal allele) while the others have a mismatch at 
the 3* terminal nucleotide of the primer. As noted, 
the four unique primers are individually paired for 
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amplification (e.g., by PGR amplification) with a 
common distant primer. When the mutated allele is 
present, the primer pair including the allele 
specific primer will amplify efficiently and yield a 
detectable product. While the mismatched primers may 
anneal, the strand will not be extended during 
amplification. 

The above primer combination is useful where 
a mutation is known to exist at a single position on 
an allele of interest. Where the mutation may exist 
at one of two locations, eight pair of 
oligonucleotide primers may be used. The first set 
of four pair is as described above. The second four 
pair or primers comprises four allele specific 
oligonucleotide primers complementary to the gene 
sequence contiguous with the site of the second 
possible mutation on the sense strand. These four 
primers differ at the terminal 3* nucleotide which is 
complementary to the wild type nucleotide or to one 
of the three possible mutations which can occur at 
this second known position. Each of the four allele 
specific primers is paired with a single common 
distant primer which is complementary to the 
antisense strand upstream of the mutation. 

During a PGR amplification using the above 
primers, only the primer which is fully complementary 
to the allele which is present will anneal and 
extend. The primers having a non-complementary 
nucleotide may partially anneal, but will not extend 
during the amplification process. Amplification 
generally is allowed to proceed for a suitable number 



wo 93/22456 



-12- 



PCr/US93/03561 



of cycles y i.e., from about 20 to 40, and most 
preferably for about 30* This technique amplifies a 
mutation-containing fragment of the target gene with 
sufficient sensitivity to enable detection of the 
mutated target gene against a significant background 
of normal DNA. 

The K-ras gene has point mutations which 
usually occur at one or two known positions in a 
known codon. Other oncogenes may have mutations at 
known but variable locations. Mutations with the 
K-ras gene are typically known to be associated with 
certain cancers such as adenocarcinomas ot the lung^ 
pancreas, and colon. Figures lA through 2B 
illustrate a strategy for. detecting, through PGR 
amplif icationr a mutation occurring at position 1 or 
2 of the 12th codon of the K-ras oncogene. As 
previously noted, mutations at the first or. second 
position of the 12th codon of K-ras are often 
associated. with certain cancers such as 
adenocarcinomas of the lung, pancreas, and colon. 

Referring to Figures lA and IB, the DNA from 
the patient sample, is separated into two strands (A 
and B)^ which represent the sense and antisense 
strands. The DNA represents an oncogene having a 
point mutation which occurs on the same codon (i.e., 
codon 12) at position 1 (Xi) . The ailele-specific 
primers used to detect the mutation at position 1, 
include a set of four PI sense primers (Pl-A) , each 
of which is unique with respect to the others. The 
four Pl-A primers are complementary to a gene 
sequence contiguous with the site of the mutation on 
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strand h. The four Pl-A primers preferably differ 
from each other only at the terminal 3 •nucleotide 
which is complementary to the wild type nucleotide or 
to one of the three possible mutations which can 
occur at this known position. Only the Pl-A primer 
which is fully complementary to the 

mutation-containing segment on the allele will anneal 
and extend during amplification. 

A common downstream primer (Pl-B) , 
complementary to a segment of the B strand downstream 
with respect to the position of the Pl-A primers, is 
used in combination with each of the Pl-A primers. 
The Pl-B primer illustrated in Figure 1 anneals to 
the allele and is extended during the PGR. Together, 
the Pl-A and Pl-B primers identified in Table 1 and 
illustrated in Figure IB amplify a fragment of the 
oncogene having 161 base pairs. 

Figures 2A and 2B illustrate a scheme 
utilizing an additional set of four unique, allele 
specific primers (P2-A) to detect a mutation* which 
can occur at codon 12 of the oncogene, at position 2 
(X2). The amplification strategy illustrated in 
Figures lA and IB would be used in combination with 
that illustrated in Figures 2A and 2B to detect 
mutations at either position 1 (Xi) or position 
2 (X2) in Codon 12. 

Referring to Figures 2A and 2B, a set of 
four unique allele specific primers (P2-A) are used 
to detect a mutation present at a position 2 (X2) of 
codon 12. The four P2-A primers are complementary to 
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the genetic sequence contiguous with the site of the 
second possible mutation. These four primers are 
unique with respect to each other and preferably 
differ only at the terminal 3* nucleotide which is 
complementary to the wild type nucleotide or to one 
of the three possible mutations which can occur at 
the second known position (X2) . 

A single qoimnon upstream primer <P2^B> 
complementary to a segment of the A strand upstream 
of the mutation, is used in combination with each of 
the unique P2-A primers. The P2-A and P2-B primers 
identified in Table 1 and illustrated in Figure 2B 
will amplify a fragment having 146 base pairs. 

During the amplification procedure, the 
polymerase chain reaction is allowed to proceed for 
about 20 to 40 cycles and most preferably for 30 
cycles. Following an initial denaturation period of 
about 5 minutes, each cycle, using the AmpliTaq DNA 
polymerase, typically includes about one minute of 
denaturation at 95® C, two minutes^ of primer 
annealing at about 58"* C, and a one minute extension 
at 72'*. C. While the temperatures and cycle times 
noted above are currently preferred, it is noted that 
various modifications may be made. Indeed, the use 
of different DNA. polymerases and/or different primers 
may necessitate changes in the amplification 
conditions. One skilled in the art will readily be 
able to optimize the amplification conditions. 
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Exemplary DNA primers which are useful in 
practicing the method of this invention to detect the 
K-ras gene, having point mutations at either the 
first or second position in codon 12 of the gene, are 
illustrated in Table 1. 

TABLE 1 

Primers Used to Amplify (by PGR) Position 1 
and 2 Mutations at Codon 12 of K-ras Gene 
(5'-3') 

Sequence* '. Rlranfi PI OT P2 

GTGGTAGTTGGAGCTG A 

GTGGTAGTTGGAGCTC A 

GTGGTAGTTGGAGCTl A 

GTGGTAGTTGGAGCTA A 

CAGAGAAACCTTTATCTG B 

ACTCTTGCCTACGCCAC A 

ACTCTTGCCTACGCCAS A 

ACTCTTGCCTACGCCAl A 

ACTCTTGCCTACGCCAA A 

GTACTGGTGGAGTATTT B 



"Underlined bases denote mutations. 

The primers illustrated in Table 1 are, of 
course, merely exemplary. Various modifications can 
be made to these primers as is understood by those 
having ordinary skill in the art. For example, the 
primers could be lengthened or shortened, however the 
3' terminal nucleotides must remain the same. In 



PI 
PI 
PI 
PI 
PI 



P2 
P2 
P2 
P2 
P2 



addition, some misinatches 3 to 6 nucleotides back 
from the 3' end may be made and would not be likely 
to interfere with efficacy. The common primers can 
also be constructed differently so as to be 
complementary to a different site, yielding either a 
longer or shorter amplified product. 

In one embodiment, the length of each allele 
specific primer can be different, making it possible 
to combine multiple allele specific primers -with 
their common distant primer in the same PGR 
reaction. The .length of the amplified product would 
be indicative of which allele specific primer was 
being utilized with the amplification. The length of 
the amplified product would indicate which mutation 
was present in the specimen. 

The primers illustrated in Table I and 
Figures IB and 2Br and others which could be used, 
can be readily synthesized by one having ordinary 
skill in the art. For example, the preparation of 
similar primers has been described by Stork et al., 
Oncooene . 6:857-862, 1991^ 

Other amplification methods and strategies 
may also be utilized to detect gene sequences in 
biological fluids according to the method of the 
invention. For example r another approach would be to 
combine PGR and the ligase chain reaction (LCR>. 
Since PGR amplifies faster than LCR and requires 
fewer copies of target DNA to initiate, one could use 
PGR as first step and then proceed to tGR. Primers 
stich as the common primers used in the allele 
specific amplification described previously which 
span a sequence of approximately 285 base pairs in 
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length, more or less centered on codon 12 of K-ras, 
could be used to amplify this fragment, using 
standard P.CR conditions. The amplified product 
(approximately a 285 base pair sequence) could then 
be used in a LCR or ligase detection reaction (LDR) 
in an allele specific manner which would indicate if 
a mutation was. present. Another, perhaps less 
sensitive, approach would be to use LCR or LDR for 
both amplification and allele specific 
discrimination. The later reaction is advantageous 
in that it results in linear amplification. Thus the 
amount of amplified product is a reflection of the 
amount of target DNA in the original specimen ania 
therefore permits quantitation. 

LCR utilizes pairs of adjacent 
oligonucleotides which are complementary to the 
entire length of the target sequence (Barany F., PNAS 
88: 189-193, 1991; Barany F., PGR Methods and 
Applications 1: 5-16, 1991). If the target sequence 
is perfectly complementary to the primers at the 
junction of these sequences, a DNA ligase will link 
the adjacent 3* and 5* terminal nucleotides forming a 
combined sequence. If a thermostable DNA ligase is 
used with thermal cycling, the combined sequence will 
be sequentially amplified. A single base mismatch at 
the junction of the olignoucleotides will preclude 
ligation and amplification. Thus, the process is 
allele specific. Another set of oligonucleotides 
with 3' nucleotides specific for the mutant would be 
used in another reaction to identify the mutant 
allele. A series of standard conditions could be 
used to detect all possible mutations at any known 
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site. LCR typically utilizes both strands of genomic 
DHA as targets for aligonucleotide hybridization with 
four primers^ and the product is increased 
exponentially by repeated thermal cycling. 

A variation of the reaction is the ligase 
detection reaction (LDR) which utilizes two adjacent 
oligonucleotides which are complementary to the 
target DNA and are similarly joined by DNA ligase 
(Barany F., PNAS 88:189-193, 1991). After multiple 
thermal cycles the product is amplified in a linear 
fashion. Thus the amount of the product of LDR 
reflects the amount of target DNA . Appropriate 
labeling of the primers allows detection of the 
amplified product in an allele specific manner, as 
well as quantitation of the amount of original target 
DNA. One advantage of this type of reaction is that 
it allows quantitation through automation (Nicker son 
et al., PNAS 87: 8923-8927, 1990). 

Examples of suitable oligonucleotides for 
use with LCR for allele specific ligation and 
amplification to identify mutations at position 1 in 
codon 12 Of the K-ras gene are illustrated below in. 
Table 2. 
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TABLE 2 

Oligonucleotides f5'-3') fo r use in LCR 



Sequence* Str9n<? PI 071 ?2 



AGCTCCAACTACCACAACaTT 


Al 




A 


GCACTCTTGCCTACGCCACC 


A2- 


-A 


A 


GCACTCTTGCCTACGCCACA 


A2- 


-B 


A 


GCACTCTTGCCTACGCCACfi 


A2- 


-C 


A 


GCACTCTTGCCTACGCCAC2 


A2- 


-D 


A 


GGTGGCGTAGGCAAGAGTGC 


Bl 




B 


AACTTGTGGTAGTTGGAGCT 


B2- 


-A 


B 


AACTTGTGGTAGTTGGAGCA 


B2- 


-B 


B 


AACTTGTGGTAGTTGGAGCC 


B2- 


-C 


B 


AACTTGTGGTAGTTGGAGCfi 


B2- 


-D 


B 



*Under lined bases denote mutations. 



During an amplification procedure involving 
LCR four oligonucleotides are used at a time. For 
example, oligonucleotide Al and, separately, each of 
the A2 oligonucleotides are paired on the sense 
strand. Also, oligonucleotide Bl and, separately, 
each of the B2 oligonucleotides are paired on the 
antisense strand. For an LCD procedure, two 
oligonucleotides are paired, i.e., Al with each of 
the A2 oligonucleotides, for linear amplification of 
the normal and mutated target DNA sequence. 
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The method of the invention is applicable to 
the detection and quantitation of other oncogenes in 
DNA present in various biological fluids. The p53 
gene is a gene for which convenient detection and 
quantitation could be useful because alterations in 
this gene are the most conmon genetic anomaly in 
human cancer, occurring in cancers of many histologic 
types arising" from many anatomic sites. Mutations of 
the p53 may occur at multiple codons within the gene 
but 80% are localized within 4 conserved regions, or. 
"hot spots'*, in exons 5, 6, 7 and 8. The most 
popular current method for identifying the mutations 
in p53 is a multistep procedure. It involves PGR 
amplification of exons 5-8 from genomic DNA, 
individually, in combination (i.e., multiplexing), or 
sometimes as units of more than one exon. -An 
alternative approach is to isolate total cellular 
RNA, which is transcribed with reverse 
transcriptase. A portion of the reaction mixture is 
subjected directly to PCR to amplify the regions of 
p53 cDNA using a pair of appropriate oligonucleotides 
as primers. These two types of amplification are 
followed by single strand conformation polymorphism 
analysis (SSCP) which will identify amplified samples 
with point mutations from normal DNA by differences 
in mobility when electrophoresed in polyacrylamide 
gel. If a fragment is shown by SSCP to contain a 
mutation, the latter is amplified by asymmetric PCR 
and the sequence determined by the dideoxy-chain 
termination method (Murakami et al, £aaj_B££.., 51: 
3356-33612, 1991). 
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Further, the ligase chain reaction (LCR) may 
be useful with p53 since LCR is better able to 
evaluate multiple mutations at the same time. After 
determining the mutation, allele specific primers can 
be prepared for subsequent quantitation of the 
mutated gene in the patient's plasma at multiple 
times during the clinical course. 

Preferably, the method of the invention is 
conducted using biological fluid samples of 
approximately 5ml. However, the method can also be 
practiced using smaller sample sizes in the event 
that specimen supply is limited. In such case, it 
may be advantageous to first amplify the DNA present 
in the sample using the common primers. Thereafter, 
amplification can proceed using the allele specific 
primers . 

The method of this invention may be embodied 
in diagnostic kits. Such kits may include reagents 
for the isolation of DNA as well as sets of primers 
used in the detection method, and reagents useful in 
the amplification. Among the reagents useful for the 
kit is a DNA polymerase used to effect the 
amplification. A preferred polymerase is Xhgrn^ws 
aauaticus DNA polymerase available from Perkin-Elmer 
as AmpliTaq DNA polymerase. For quantitation of the 
mutated gene sequences, the kit can also contain 
samples of mutated DNA for positive controls as well 
as tubes for quantitation by competitive PCR having 
the engineered sequence in known amounts. 
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The quantitation of the mutated K-ras . 
sequences may be achieved using either slot blot 
Southern hybridization or competitive PGR. Slot blot 
Southern hybridization can be a performed utilizing 
the allele specific primers as probes under 
relatively stringent conditions as described by 
Verlaan-de Vries et al.^ (Sana. 50:313-20^ 1986. The 
total DNA extracted from. 5 ml of plasma will be slot 
blotted with 10 fold serial dilutions, followed by 
hybridization to an end-labeled allele specific probe 
selected to be complementary to the known mutation in 
the particular patient ' s. tumor DNA as determined 
previously by screening with the battery of allele 
specific primers and PGR and LCR. Positive 
autoradiographic signals will be graded 
semiquantitatively by densitometery after comparison 
with a standard series of diluted DNA (1-500 ng) from 
tumor cell cultures which have the identical mutation 
in codon 12 of the K-ras, prepared as slot blots in 
the same way. 

A modified competitive PGR (Gil li land et 
alw Proc. Nat, ACad. Sci. . USA 87:2725:79; 1990; 
Gilliland et al»^ 'Xompetitive PGR for Quantitation 
of MRNA"^ PGR Protocols (Acad. Press) ^ pp. 60-69, 
1990) could serve as a potentially more sensitive 
alternative to the slot blot Southern hybridization 
quantitation method. In this method of quantitation, 
the same pair or primers are utilized to amplify two 
DNA templates which compete with each other during 
the amplification process*. One template is the 
sequence of interest in unknown amount, i.e. mutated 
K-ras, and the other is an engineered deletion mutant 
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in known amount which, when amplified, yields a 
shorter product which can be distinguished from the 
amplified mutated K-ras sequence. Total DNA 
extracted from the plasma as described above will be 
quantitated utilizing slot blot Southern 
hybridization, utilizing a radiolabelled human 
repetitive sequence probe (BLURB). This will allow a 
quantitation of total extracted plasma DKA so that 
the same amount can be used in each of the PGR 
reactions. DMA from each patient (100 ng) will be 
added to a PGR master mixture containing PI or P2 
allele specific primers corresponding to the 
particular mutation previously identified for each 
patient in a total volume of 400 vl- Forty )il of 
master mixture containing 10 ng of plasma DNA will be 
added to each of 10 tubes containing 10 yil of 
competitive template ranging from 0.1 to 10 
attomoles. Each reaction mixture will contain dNTPs 
(25vM final concentration including [a-32p]dGTP at 
SOpCi/ml), 50 pmoles of each primer, 2mM MgCl2» 2 
units of 3V:. aauaticus DNA polymerase, 1 x PGR buffer, 
50 ug/ml BSA, and water to a final volume of 40 
Thirty cycles of PGR will be followed by 
electrophoresis of the amplified products. Bands 
identified by ethidium bromide will excised, counted 
and a ratio of K-ras sequence to deletion mutant 
sequence calculated. To correct for difference in 
molecular weight, cpm obtained for genomic K-ras 
bands will multiplied by 141/161 or 126/146, 
depending upon whether position 1 (Pi) or position 2 
(P2) primers are used. (The exact ratio will depend 
upon the length of the deletion mutant.) Data will 
be plotted as log ratio of deletion template 
DNA/K-ras DNA vs. log input deletion template DNA 
(Gilliland et al. 1990a, 1990b). 
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A modified competitive PCR could also be 
developed in which one primer has a modif iect 5 • end 
which carries a biotin moiety and the other primer 
has a 5* end with a fluorescent chromophore. The 
amplified product can then be separated from the 
reaction mixture by adsorption to avidin or 
streptavidin attached to a solid support. The. amount 
of product formed in the PCR can be quantitated by 
measuring the amount of fluorescent primer 
incorporated into double-stranded DNA by denaturing 
the immobilized DNA by alkali and thus eluting the 
fluorescent single stands from the solid support and 
measuring the fluorescence (Landgraf et al., Anal , 
Biochem . 182! 231-235, 1991). 

The competitive template preferably 
comprises engineered deletion mutants with a sequence 
comparable to the fragments of the wild type K-ras 
and the mutated K-ras gene amplified by the PI and P2 
series of primers described previously, except there 
will be an internal deletion of approximately 2Q 
nucleotides. Therefore, the amplified products will 
smaller, i.e., about 140 base pairs and 125 base 
pairs when the PI primers and F2 primers are used, 
respectively. Thus, the same primers can be used and 
yet amplified products from the engineered mutants 
can be readily distinguished from the amplified 
genomic sequences. 

Eight deletion mutants will be produced 
using the polymerase chain reaction (Higuchi et al.. 
Nucleic Acids Res > 16:7351-67 1988); Vallette et al.. 
Nucleic Acids Res . 17:723-33, 1989; Higuchi, 
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PGR Technology . Ch. 6, pp. 61-70 (Stockton Press, 
1989)). The starting material will be normal genomic 
DNA representing the wild-type K-ras or tumor DNA 
from tumors which are known to have each of the 
possible point mutations in position one and two of 
codon 12. The wild- type codon 12 is GGT. The 
following tumor DNA can be used: 

First position codon 12 mutations 

G-A A549 

G-T* Galul, PR371 

G-C A2182, A1698 

Second position codon 12 mutations 
G-A* As pel 
G-T* SW480 

G-C 818-1, 181-4, 818-7 

(*G-^T transversions in the first or second position 
account for approximately 80% of the point mutations 
found in pulmonary carcinoma and GAT (aspartic acid) 
or GTT (valine) are most common in pancreatic 
cancer. 

The deletion mutants with an . approximately 
20 residue deletion will be derived as previously 
described (Vallette et al. 1989). In summary, the PI 
and P2 primers will be used in an allele specific 
manner with the normal DNA or with DNA from the tumor 
cell line with each specific mutation. Each of these 
would be paired for amplification with a common 
primer which contains the sequence of the common 
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primer normally used with either the PI and P2 allele 
specific primers, i.e., "Pi-B" or -P2-B- at the 5' 
end with an attached series of residues representing 
sequences starting approximately 20 bases downstream, 
thus spanning the deleted area (common deletion 
primer 1 and 2, GDI and CD2), The precise location 
and therefore sequence of the 3' portion of the 
primer will be determined after analysis of the 
sequence of the ras gene iti this region with OLIGO 
(NBl, Plymouth, MN) , a computer program which 
facilitates the selection of optimal primers. The 
exact length of the. resultant amplified product is 
not critical, so the best possible primer which will 
produce a deletion of 20-25 residues will be 
selected. For example, with P2 primers the allele" 
specif ic primer for the wild-type will be 
5* ACTCTTGCCTACGCCAC 3' complementary to residues 35 
to 51 in the coding sequence. To effect a deletion 
of approximately 20 residues in the complementary, 
strand, the common upstream primer to be used with 
the wild-type and the three allele specific primers 
for mutations in position two of codon 12 will be 40 
residues long (CD2) complementary to residues -95 to 
-78 (the currently preferred common upstream primer 
for use with P2 allele specific primers and residues 
at approximately -58 to -25>. The amplified shorter 
product will be size-separated by gel electrophoresis 
and purified by Prep-a-Gene (Biorad). DNA 
concentrations will be determined by the ethidium 
bromide staining with comparison to dilutions of DNA 
of known concentration. This approach will be 
repeated eight times, using the four PI primers and 
common primer (GDI) constructed as above, and four 
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times with the four P2 primers and common primer 
(CD2) . These deletion mutants will be amplified, 
using the same allele specific primers used to 
amplify the genomic DNA. Therefore, they can be used 
subsequently in known serial dilutions in a 
competitive PCR, as outlined above. 

The invention. is further illustrated by the 
following non-limiting examples. 

Example 1 

Blood was collected in 13 x 75 mm vacutainer 
tubes containing 0.05 ml of 15% K3EDTA. The tubes 
were immediately centrifuged at 4'C for 30 minutes at 
1000 g, the plasma was removed and recent rifuged at 
4*C for another 30 minutes at 1000 g. 
The plasma was stored at -70"*C. Next, DNA was 
deproteinized by adding an equal volume of 20% NaCl 
to 5 ml aliquots of plasma which were then boiled for 
3 to 4 minutes. After cooling, the samples were 
centrifuged at 3000 rpm for 30 minutes. The 
supernatant was removed and dialysed against three 
changes of 10 mM Tris-HCl <pH 7.5)/l mM EDTA (pH 8.0) 
("TE") for 18 to 24 hours at 4*0. The DNA was 
extracted once with two volumes of phenol, 2x1 volume 
phenol: chloroform: isoamyl alcohol (25:24:1) and 2x1 
volume chloroform: isoamyl alcohol (24:1). DNA was 
subsequently precipitated with NaCl at 0.3M, 20pg/ml 
glycogen as a carrier and 2.5 volumes of 100% ethanol 
at minus 20*C for 24 hours. DNA was recovered by 
centrifugation in an Eppendorf Centrifuge at 4*C for 
30 minutes- The DNA was then resuspended in a TE 
buffer. The DNA extracted and prepared in the above 
manner was then able to be amplified. 
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Example 2 

AA allele specific amplification of DNA 
obtained and prepared according to example 1 was 
conducted by PCR as follows to detect the K-ras gene 
in the DNA having a mutation at position 1 pr 2 of 
the codon 12 of the K-ras gene. In each of eight 
reaction tubes was added DNA extracted from 0.S ml of 
plasma in total volume of 40^11 containing 67 mM 
Tris-HCl (pH 8.8K 10 mM B-mercaptoethanol, 16.6 yiM; 
ammonium sulfate, 6.7 jiM EDTA, 2.0mM, MgCl2, 50iig/ml 
BSA, 25vM dNTP. Also, 50 pmoles of each of the 
primers identified in Table 1 was included, together 
with 3 units of Tharmus amiaticus DNA polymerase 
(available from Perkin-Elmer as AmpliTaq) . PCR was 
conducted with an initial denaturation at 95'C for 5 
minutes, followed by 30 cycles of PCR amplification 
in a DNA thermal cycler (Cetus; Perkin-Elmer Corp. 
Norwalk, Connecticut) . Each amplification cycle 
includes a 1 minute denaturation at 95"*C, a 2 minute 
primer annealing period at 58*C, and a 1 minute 
extension period at 72'*C. 

Following the completion of amplification, 
10-15iil of each of the PCR reaction products is 
analyzed by electrophoresis in a 2% agarose gel/lX 
TAE-O.Spg/ml EtBr. The electrophoresis uses an 
applied voltage of 100 volts for 90 minutes. 
Photographs of the samples are then taken using 
ultraviolet light under standard conditions. 

It is understood that various modifications 
can be made to the present invention without 
departing from the scope of the claimed invention. 



wo 93/22456 



-29- 



PCr/US93/03561 



SEQUENCE LISTING 

(1) GENERAL INFORMATION: 

(i) APPLICANT: Soxenson, George D. 

(ii) TITLE OF INVENTION: Detection of 

Gene Sequences 
In Biological 
Fluids 

(iii) NUMBER OF SEQUENCES: 20 

(iv) CORRESPONDENCE ADDRESS: 

(A) ADDRESSEE: Lahive & Cockfield 

(B) STREET: 60 State Street 

(C) CITY: Boston 

(D) STATE: Massachusetts 

(E) COUNTRY: U.S.A. 

(F) ZIP: 02109 

(v) COMPUTER READABLE FORM: 

(A) MEDIUM TXPE: Floppy Disk 

(B) COMPUTER: IBM PC compatible 

(C> OPERATING SYSTEM: PC-DOS/MS-DOS 
(D) SOFTWARE: ASCII Text 

(vi) CURRENT APPLICATION DATA: 
(A) APPLICATION NUMBER: 

<B) FILING DATE: 27 APR - 1992 

(C) CLASSIFICATION 

(viii) ATTORNEY/AGENT INFORMATION: 

(A) NAME: William C. Geary III 

(B) REGISTRATION NUMBER: 31,357 

(C) REFERfiNCE/DOCKET NUMBER: DCI-037 
(ix) TELECOMMUNICATION INFORMATION: 

<A> TELEPHONE (617) 227-7400 
(B) TELEFAX: (617) 227-5941 
(xi) SEQUENCE DESCRIPTION: SEQ ID N0:1 
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(2) INFOKMATION FOR SEQ ID N0:1: 

(i) SEQUENCE CHARACTERISTICS: 
(A) LENGTH: 16 base pairs 
CB) ;typE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA. 

(iii) SEQUENCE DESCRIPTION: SEQ ID N0:1: 
GTGGTAGTTG GAGCTG 16 



(2) INFORMATION FOR SEQ ID NO: 2: 

(i) SEQUENCE CHARACTERISTICS: 

. (A) LENGTH: 16 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID N0:2; 
GTGGTAGTTG GAGCTC 16 



(2) INFORMATION FOR SEQ ID NO: 3: 

(i) SEQUENCE CHARACTERISTICS: 
(A> LENGTH: 16 base pairs 
(B) TXPB: nucleic acid 
(C> STRANDEDNESS: single 
(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 3: 



GTGGTAGTTG GAGCTT 



16 
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(2) INFORMATION FOR SEQ ID N0:4: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 16 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 4: 
GTGGTAGTTG GAGCTA 



(2) INFORMATION FOR SEQ ID NO: 5: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 18 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 5: 
CAGAGAAACC TTTATCTG 



(2) INFORMATION FOR SEQ ID NO: 6: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 
( i i ) MOLECULE TYPE : DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 6: 
ACTCTTGCCT ACGCCAC 
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(2) INFORMATION FOR SEQ ID N0:7: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TyPEr nucleic acid 
CO STRANDEDNESS: single 
(D) TOPOLOGY: linear 

(ii) MOLECULE TXPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ in NO: 7: 
ACTCTTGCCT ACGCCAG 



(2) INFORMATION FOR SEQ ID NO :"8: 

(i) SEQUENCE CHARACTERISTICS: 
CA) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : - s ingle 
(D> TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 8: 
ACTCTTGCCT ACGCCAA 



(2) INFORMATION FOR SEQ ID N0l9: 

(i) SEQUENCE CHARACTERISTICS: 
CA> LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS; single 

(D) TOPOLOGY: linear 
(ii> MOLECULE TYPE; DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 9: 



ACTCTTGCCT ACGCCAT 



17 
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(2) INFORMATION FOR SEQ ID NO: 10: 

(i) SEQUENCE CHARACTERISTICS: 
(A) LENGTH: 17 base pairs 
(B> TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 10: 
GTACTGGTGG AGTATTT 17 

(2) INFORMATION' FOR SEQ ID NO: 11:' 

(i) SEQtJENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 11: 
AGCTCCAACT ACCACAAGTT 20 



(2) INFORMATION FOR SEQ ID NO: 12: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 12: 
GCACTCTTGC CTACGCCACC 
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(2) INFORMATION FOR SEQ ID NO: 13: 

(i) SEQUENCE CHARACTERISTICS: 
(A> LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANOEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 13: 
GCACTCTTGC CTACGCCACA 20 



(2) INFORMATION FOR SEQ ID NO: 14: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS:. single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 14: 
GCACTCTTGC CTACGCCACG 20 



(2) INFORMATION FOR SEQ ID NO: 15: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

.(iii> SEQUENCE DESCRIPTION: SEQ ID NO: 15: 



GCACTCTTGC CTACGCCACT 



20 
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(2) INFORMATION FOR SEQ ID NO: 16: 

(i) SEQUENCE CHARACTERISTICS: 
• (A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 16: 
GGTGGCGTAG GCAAGAGTGC 20 



(2) INFORMATION FOR SEQ ID NO: 17: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 17: 
AACTTGTGGT AGTTGGAGCT 20 



(2) INFORMATION FOR SEQ ID NO: 18: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 18: 



AACTTGTGGT AGTTGGAGCA 



20 
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(2) INFORMATION FOR SEQ ID NO: 19: 

(i) . SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 
(B> TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: . SEQ IP NO: 19: 

AACTTGTGGT AGTTGGAGCC 20- 

(2) INFORMATION' FOR SEQ ID N0:20: 

{i> SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 
<D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(iii) SEQUENCE DESCRIPTION: SEQ ID NO: 20: 

I 

AACTTGTGGT AGTTGGAGCG 20 
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Claims : 

1. A method of detecting a mutant allele, 
comprising the steps of: 

providing a sample of a biological 
fluid containing soluble DNA, including a mutant 
allele of interest; 

extracting the DNA from the sample; 

denaturing the DNA to free first and 
second strands of the DNA; 

"amplifying the mutant allele of 
interest in an allele specific manner using at least 
a first set of four allele specific oligonucleotide 
primers having one primer complementary to a 
mutation-containing segment on a first strand of the 
DNA and a first common primer for pairing during 
amplification to each allele specific primer^ the 
common primer being complementary to a segment of a 
second strand of the DNA distant with respect to the 
position of the first primer; and 

detecting the presence of the mutant 
allele of interest. 

2. The method of claim 1 further 
comprising the step of removing protein from the 
sample and inactivating any DNase within the sample 
before the step of extracting the DNA. 
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3. The method of claim 2, wherein the 
mutant allele is amplified in an allele specific 
manner using the polymerase chain reaction (PCR) . 

4, The method of claim 3, wherein 
following the amplification step, the step, of 
detecting the presence of the mutant allele of 
interest comprises performing an allele specific 
ligase chain reaction (LCR) or a ligase detection 
reaction (LDR) using the amplified product of PCR* 

Sir The method of claim 2 wherein protein 
is removed and DNases are inactivated by adding a 
salt solution to the sample and subsequently boiling 
the sample. 

6 . The method of claim 2 wherein the 
biological fluid is selected from the group 
consisting of whole blood, serum, plasma, urine, 
sputum, and cerebral spinal fluid. 

7. The method of claim 2 wherein the 
mutant allele comprises a gene sequence having a 
point mutation at a known location. 

8. The method of claim 7 wherein the first 
DNA strand is the sense strand and the second ONA 
strand is the antisense strand* 

9. The method of claim 2 wherein the step 
of amplifying the mutant allele with the PCR is 
conducted using a DNA polymerase which lacks the 3* 
exonuclease activity and therefore the ability to 
repair single nucleotide mismatches at the 3' end of 
the primer. 
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io. The method of claim 9 wherein the DNA 
polymerase is a Thermus aauaticus DNA polymerase. 

11. The method of claim 9 wherein the first 
set of allele specific oligonucleotide primers 
comprises: 

four sense primers, one of which has a 
3" terminal nucleotide complementary to a point 
mutation of the sense strand, and the remaining three 
of which are complementary to the wild type sequence 
for the segment to be amplified and to sequences 
having the remaining two possible mutations at the 
mutated point of the sense strand; and 

a common antisense primer complementary 
to a segment of the antisense strand distant from the 
location on the sense strand at which the sense 
primers will anneal, the common antisense primer 
being paired with each of the sense primers during 
amplification. 

12. The method of claim 11 wherein the 3' 
terminal nucleotide of the complementary sense primer 
anneals with the mutated nucleotide of the sense 
strand. 

13. The method of claim 3 wherein the 
mutant allele comprises a gene sequence having a 
point mutation at one of two known locations. 
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14. The method of claim 13 wherein the step 
of amplifying the mutant allele through the PGR 
further comprises the use of a second set of four 
allele specific oligonucleotide primers, in 
conjunction with the fir^t set, wherein the second 
set of allele specific oligonucleotide primers 
comprises: 

four sense primers, one of which has a 
3^ terminal nucleotide complementary to a point 
mutation of the sense strand, and the remaining three 
of which are complementary to the wild type sequence 
for the segment to be amplified and sequences having 
the remaining two possible mutations at the mutated 
point of the sense strand; and 

a common antisense primer complementary 
to a segment of the antisense strand distant from the 
location on the sense strand at which the sense 
primers will anneal, the common antisense primer 
being paired with each of the sense primers during 
amplification. 

15. The method of claim 14 wherein the 3' 
terminal nucleotide of the complementary sense primer 
anneals with the mutated nucleotide of the sense 
strand. 

16. The method of claim 15 wherein the 
mutant allele to be detected is the K-ras gene 
sequence having a mutation at position 1 or 2 in the 
twelfth codon. 
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17. The method o£ claim 16 wherein the 
first set o£ allele specific oligonucleotide primers 
comprises sense primers having the following sequences 

5'GTGGTAGTTGGAGCTG 3* (wild type) 
5 • GTGGTAGTTGGAGCTC 3' 
5 ' GTGGTAGTTGGAGCTT 3' 
5 • GTGGTAGTTGGAGCTA 3* 

and the common antisense primer having the following 
sequence 

5 • CAGAGAAACCTTTATCTG 3 ' . 

18. The method of claim 14 wherein the 
second set of allele specific oligonucleotide primers 
comprises sense primers having the following sequences 

5'ACTCTTGCCTACGCCAC 3* (wild type) 
5'ACTCTTGCCTACGCCAG 3' 
5'ACTCTTGCCTACGCCAT 3* 
5'ACTCTTGCCTACGCCAA 3* 

and the common antisense primer having the following 
sequence 

5'GTACTGGTGGAGTATTT 3*. 

19. The method of claim 2 wherein the step 
of detecting the presence of amplified DNA is 
conducted by gel electrophoresis in 1-5% agarose gel. 
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20. The method of claim 2 wherein the 
biological fluid is selected from the group 
consisting of whole bloody serum, plasma, urine, 
sputum, and cerebral spinal fluid. 

21. A diagnostic kit for detecting the 
presence of a mutated K-ras gene sequence in 
biological fluids wherein the mutation is present in 
the twelfth codon at position 1^ comprising: 

reagents to facilitate the 
deproteinization and isolation of DNA; 

reagents to facilitate amplification by 

PGR;. 

a heat stable DNA polymerase; and 
a first set of allele specific 

oligonucleotide sense primers having the following 

sequences 

S^GTGGTAGTTGGAGCTG 3' 
5 ^GTGGTAGTTGGAGCTC 3' 
5'GTGGTAGTTGGAiSCTT 3' 
5 • GTGGTAGTTGGAGCTA 3 * 

and a first common antisense primer having 
the following sequence 

5'CAGAGAAACCTTTATCTG '3 

22. The diagnqstic kit of claim 21 further 
comprising 

a second set of allele specific 
oligonucleotide sense primers having the following 
sequences 
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5'ACTCTTGCCTACGCCAC 3* 
5'ACTCTTGCCTACGCCAG 3' 
5'ACTCTTGCCTACGCCAT 3* 
5'ACTCTTGCCTACGCCAA 3' 

and a second common antisense primer having 
the following sequence 

5 'GTACTGGTGGAGTATTT 3 ' 

wherein the second set of allele specific 
oligonucleotide primers and the second common primer 
are useful in detecting in biological fluid the 
presence of a mutated K-ras gene sequence in the 
twelfth codon at position 2. 
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SECRETED PROTEINS 

5 FTF.l ,D OF THF. INVENTION 

The present invention provides novel proteins . along with therapeutic, diagnostic and 
research utilities for these proteins. 

R ACKGROUND OF THE INVENTION 

10 Technology aimed at the discovery of protein factors (including e.g., cytokines, such 

as lymphokines, interferons, CSFs and interleukins) has matured rapidly over the past decade. 
The now routine hybridization cloning and expression cloning techniques clone novel 
polynucleotides "directly" in the sense that they rely on information directly related to the 
discovered protein (i.e., partial DNA/amino acid sequence of the protein in the case of 

1 5 hybridization cloning; activity of the protein in the case of expression cloning). More recent 
"indirect" cloning techniques such as signal sequence cloning, which isolates DNA sequences 
based on the presence of a now well-recognized secretory leader sequence motif, as well as 
various PCR-based or low stringency hybridization cloning techniques, have advanced the state 
of the art by making available large numbers of DNA/amino acid sequences for proteins that 

20 are known to have biological activity by virtue of their secreted nature in the case of leader 
sequence cloning, or by virtue of the cell or tissue source in the case of PCR-based techniques. 
It is to these proteins that the present invention is directed. 

TMM AR V OF THE IN VENTION 
25 In one embodiment, the present invention provides a composition comprising an 

isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:l; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 1 
30 from nucleotide 28 to nucleotide 276; 

(c) a polynucleotide comprising the nucleotide sequence of the full length 
protein coding sequence of clone AE402_1 i deposited under accession number ATCC 
98190; 

(d) a polynucleotide encoding the full length protein encoded by the 
35 cDNA insert of clone AE402_1 i deposited under accession number ATCC 98 1 90; 
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(e) a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone AE402_li deposited under accession number ATCC 
98190; 

(f) a polynucleotide encoding the mature protein encoded by the cDNA 
5 insert of clone AE402_li deposited under accession number ATCC 98190; 

(g) a polynucleotide encoding a protein comprising the amino acid 
sequence of SEQ ID N0:2; 

(h) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID N0:2 having biological activity; 

^0 (i) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 

(0 above; and 

(j) a polynucleotide which encodes a species homologue of the protein 
of (g) or (h) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO: I 
15 from nucleotide 28 to nucleotide 276; the nucleotide sequence of the full length protein coding 
sequence of clone AE402_li deposited under accession number ATCC 98190; or the 
nucleoude sequence of the mature protein coding sequence of clone AE402_li deposited under 
accession number ATCC 98190. In other preferred embodiments, the polynucleotide encodes 
the full length or mature protein encoded by the cDNA insert of clone AE402„li deposited 
20 under accession number ATCC 98 1 90. 

in other embodiments, the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID N0;2; 
25 (b) fragments of the amino acid sequence of SEQ ID NO:2; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
AE402_li deposited under accession number ATCC 98 1 90; 
the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID N0:2. 
30 In one embodiment, the present invention provides a composition comprising an 

isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

N0:4; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
35 N0:4 from nucleotide 6 1 to nucleotide 513; 
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(c) a polynucleotide comprising the nucleotide sequence of SEQ ID N0:4 

from nucleotide 322 to nucleotide 513; 

(d) a polynucleotide comprising the nuclcoUde sequence of the full length 
protein coding sequence of clone AE6 1 0_1 i deposited under accession number ATCC 

5 98190; 

(e) a polynucleoUde encoding the full length protein encoded by the 
cDNA insert of clone AE610_li deposited under accession number ATCC 98190; 

(0 a polynucleoUde comprising the nucleotide sequence of the mature 
protein coding sequence of clone AE610_1 i deposited under accession number ATCC 
10 98190; 

(g) a polynucleotide encoding the mattire protein encoded by the cDNA 
insert of clone AE610_li deposited under accession number ATCC 98190; 

(h) a polynucleotide encoding a protein comprising the amino acid 

sequence of SEQ ID N0:5; 
15 (i) a polynucleotide encoding a protein comprising a fragment of the 

amino acid sequence of SEQ ID N0:5 having biological activity; 

(j) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 

(g) above; and 

(k) a polynucleoUde which encodes a species homologue of the protein 

20 of (h) or (i) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID N0:4 
from nucleotide 61 to nucleotide 513; the nucleotide sequence of SEQ ID N0:4 from 
nucleotide 322 to nucleotide 513; the nucleotide sequence of the full length protein coding 
sequence of clone AE610_li deposited under accession number ATCC 98190; or the 

25 nucleotide sequence of the mature protein coding sequence of clone AE61 0_1 i deposited under 
accession number ATCC 98190. In other preferred embodiments, the polynucleotide encodes 
the full length or mature protein encoded by the cDNA insert of clone AE610_1 i deposited 
under accession number ATCC 98 1 90. 

In other embodiments, the present invention provides a composition comprising a 

30 protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID N0:5; 

(b) fragments of the amino acid sequence of SEQ ID N0:5; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
35 AE610_li deposited under accession number ATCC 98190; 
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the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID N0:5. 

In one embodiment, the present invention provides a composition comprising an 
isolated protein encoded by a polynucleotide selected from the group consisting of: 
5 (a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

N0:7; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID N0:7 
from nucleotide 20 to nucleotide 523; 

(c) a polynucleotide comprising the nucleotide sequence of the ful 1 length 
1 0 protein coding sequence of clone AH 1 06_1 i deposited under accession number ATCC 

98190; 

(d) a polynucleotide encoding the full length protein encoded by the 
cDNA insert of clone AH106_li deposited under accession number ATCC 98190; 

(e) a polynucleotide comprising the nucleotide sequence of the mature 
1 5 protein coding sequence of clone AH 1 06_1 i deposited under accession number ATCC 

98190; 

(0 a polynucleotide encoding the mature protein encoded by the cDNA 
insert of clone AH106_li deposited under accession number ATCC 98190; 

(g) a polynucleotide encoding a protein comprising the amino acid 
20 sequence of SEQ ID NO: 8 ; 

(h) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO:8 having biological activity; 

(i) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 
(0 above; and 

2^ 0) a polynucleotide which encodes a species homologue of the protein 

of (g) or (h) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO:7 
from nucleotide 20 to nucleotide 523; the nucleotide sequence of the full length protein coding 
sequence of clone AH106_li deposited under accession number ATCC 98190; or the 
30 nucleotide sequence of the mature protein coding sequence of clone AH106_li deposited 
under accession number ATCC 98190. In other preferred embodiments, the polynucleotide 
encodes the full length or mature protein encoded by the cDNA insert of clone AH106_li 
deposited under accession number ATCC 98190. 
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In other embodiments, the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID N0:8; 
5 (b) fragments of the amino acid sequence of SEQ ID NO:8; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
AH 1 06_1 i deposited under accession number ATCC 98 1 90; 
the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO: 8. 
10 In one embodiment, the present invention provides a composition comprising an 

isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:9; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID NO:9 
15 from nucleotide 130 to nucleotide 309; 

(c) a polynucleotide comprising the nucleotide sequence of the full length 
protein coding sequence of clone AH 1 96_1 i deposited under accession number ATCC 
98190; 

(d) a polynucleotide encoding the full length protein encoded by the 
20 cDNA insert of clone AH196_li deposited under accession number ATCC 98190; 

(e) a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone AH 1 96_1 i deposited under accession number ATCC 
98190; 

(f) a polynucleotide encoding the mature protein encoded by the cDN A 
25 insert of clone AH196„li deposited under accession number ATCC 98190; 

(g) a polynucleotide encoding a protein comprising the amino acid 
sequence of SEQ ID NO: 1 0; 

(h) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO:10 having biological activity; 

30 (i) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 

(f) above; and 

(j) a polynucleotide which encodes a species homologuc of the protein 
of (g) or (h) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID N0:9 
35 from nucleotide 130 to nucleotide 309; the nucleotide sequence of the full length protein 
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coding sequence of clone AH196_li deposited under accession number ATCC 98190; or the 
nucleotide sequence of the mature protein coding sequence of clone AH196_li deposited 
under accession number ATCC 98190. In other preferred embodiments, the polynucleotide 
encodes the full length or mature protein encoded by the cDNA insert of clone AH196_li 
5 deposited under accession number ATCC 98190. 

In other embodiments, the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO: 10; 
10 (b) fragments of the amino acid sequence of SEQ ID NO: 10; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
AH196_li deposited under accession number ATCC 98190; 
the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO: 10. 
15 In one embodiment, the present invention provides a composition comprising an 

isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

N0:12; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
20 NO: 12 from nucleotide 69 to nucleotide 467; 

(c) a polynucleotide comprising the nucleotide sequence of the full length 
protein coding sequence of clone A16_li deposited under accession number ATCC 
98190; 

(d) a polynucleotide encoding the full length protein encoded by the 
25 cDNA insert of clone AI6_1 i deposited under accession number ATCC 98 190; 

(e) a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone AI6_li deposited under accession number ATCC 
98190; 

(0 a polynucleotide encoding the mature protein encoded by the cDNA 
30 insert of clone AI6_1 i deposited under accession number ATCC 98 190; 

(g) a polynucleotide encoding a protein comprising the amino acid 
sequence of SEQ ID NO: 13; 

(h) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO: 13 having biological activity; 
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(i) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 
(0 above; and 

(j) a polynucleotide which encodes a species homologue of the protein 
of (g) or (h) above. 

5 Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO: 1 2 

from nucleotide 69 to nucleotide 467; the nucleotide sequence of the full length protein coding 
sequence of clone AI6_li deposited under accession number ATCC 98190; or the nucleotide 
sequence of the mature protein coding sequence of clone AI6„1 i deposited under accession 
number ATCC 98190. In other preferred embodiments, the polynucleotide encodes the full 
10 length or mature protein encoded by the cDNA insert of clone AI6_li deposited under 
accession number ATCC 98190. In yet other preferred embodiments, such polynucleotide 
encodes a protein comprising the amino acid sequence of SEQ ID NO:l 3 from amino acid 69 
to amino acid 133. 

In other embodiments, the present invention provides a composition comprising a 
15 protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO: 1 3; 

(b) the amino acid sequence of SEQ ID NO: 13 from amino acid 69 to 
amino acid 133; 

20 (c) fragments of the amino acid sequence of SEQ ID NO: 1 3; and 

(d) the amino acid sequence encoded by the cDNA insert of clone AI6_1 i 
deposited under accession number ATCC 98190; 
the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO: 1 3 or the amino acid sequence of SEQ ID 
25 NO: 1 3 from amino acid 69 to amino acid 1 33. 

In one embodiment, the present invention provides a composition comprising an 
isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO: 16; 

30 (b) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO: 1 6 from nucleotide 55 to nucleotide 337; 

(c) a polynucleotide comprising the nucleotide sequence of the full length 
protein coding sequence of clone AJl 3_li deposited under accession number ATCC 
98190; 
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(d) a polynucleotide encoding the full length protein encoded by the 
cDNA insert of clone AJ13_li deposited under accession number ATCC 98190; 

(e) a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone AJ 1 3_li deposited under accession number ATCC 

5 98190; 

(0 a polynucleotide encoding the mature protein encoded by the cDNA 
insert of clone AJ13_li deposited under accession number ATCC 98190; 

(g) a polynucleotide encoding a protein comprising the amino acid 
sequence of SEQ ID NO: 17; 
10 (h) a polynucleotide encoding a protein comprising a fragment of the 

amino acid sequence of SEQ ID NO: 1 7 having biological activity; 

(i) a polynucleotide v/hich is an allelic variant of a polynucleotide of (a)- 
(f) above; and 

(j) a polynucleotide which encodes a species homologue of the protein 

15 of (g)or(h) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO: 16 
from nucleotide 55 to nucleotide 337; the nucleotide sequence of the full length protein coding 
sequience of clone AJ13„Ii deposited under accession number ATCC 98190; or the nucleotide 
sequence of the mature protein coding sequence of clone AJl 3_1 i deposited under accession 

20 number ATCC 98190. In other preferred embodiments, the polynucleotide encodes the full 
length or mature protein encoded by the cDNA insert of clone AJ13_li deposited under 
accession number ATCC 98190. In yet other preferred embodiments, such polynucleotide 
encodes a protein comprising the amino acid sequence of SEQ ID NO: 1 7 from amino acid 1 2 
to amino acid 94. 

25 In other embodiments, the present invention provides a composition comprising a 

protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO: 17; 

(b) the amino acid sequence of SEQ ID NO: 17 from amino acid 12 to 
30 amino acid 94; 

(c) fragments of the amino acid sequence of SEQ ID NO: 1 7; and 

(d) the amino acid sequence encoded by the cDNA insert of clone 
AJ13_li deposited under accession number ATCC 98190; 
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the protein being substantiaUy free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO: 17 or the amino acid sequence of SEQ ID 
NO: 17 from amino acid 12 to amino acid 94. 

In one embodiment, the present invention provides a composition comprising an 
5 isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

N0:19; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
NO: 19 from nucleotide 33 to nucleoride 422; 

10 (c) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO: 19 from nucleotide 114 to nucleotide 422; 

(d) a polynucleotide comprising the nucleotide sequence of the full length 
protein coding sequence of clone AJ27„li deposited under accession number ATCC 
98190; 

15 (e) a polynucleotide encoding the full length protein encoded by the 

cDNA insert of clone AJ27_li deposited under accession number ATCC 98190; 

(f) a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone AJ27_li deposited under accession number ATCC 
98190; 

20 (g) a polynucleotide encoding the mature protein encoded by the cDNA 

insert of clone AJ27„li deposited under accession number ATCC 98190; 

(h) a polynucleotide encoding a protein comprising the amino acid 
sequence of SEQ ID NO:20; 

(i) a polynucleotide encoding a protein comprising a fragment of the 
25 amino acid sequence of SEQ ID NO:20 having biological activity; 

(j) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 
(g) above; and 

(k) a polynucleotide which encodes a species homologue of the protein 
of (h) or (i) above. 

30 Preferably* such polynucleotide comprises the nucleotide sequence of SEQ ID NO: 1 9 

from nucleotide 33 to nucleotide 422; the nucleotide sequence of SEQ ID NO: 19 from 
nucleotide 1 14 to nucleotide 422; the nucleotide sequence of the full length protein coding 
sequence of clone AJ27.1i deposited under accession number ATCC 98190; or the nucleotide 
sequence of the mature protein coding sequence of clone AJ27.1i deposited under accession 

35 number ATCC 98190. In other preferred embodiments, the polynucleotide encodes the full 
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length or mature protein encoded by the cDNA insert of clone AJ27„li deposited under 
accession number ATCC 98 1 90. 

In other embodiments^ the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from the group 
5 consisting of: 

(a) the amino acid sequence of SEQ ID NO:20; 

(b) fragments of the amino acid sequence of SEQ ID NO:20; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
AJ27_1 i deposited under accession number ATCC 98 1 90; 

10 the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO:20. 

In one embodiment, the present invention provides a composition comprising an 
isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

15 NO:22; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
NO:22 from nucleotide 47 to nucleotide 517; 

(c) a polynucleotide comprising the nucleotide sequence of SEQ ID 
NO:22 from nucleotide 1 16 to nucleotide 517; 

20 (d) a polynucleotide comprising the nucleotide sequence of the full length 

protein coding sequence of clone AJ142_1 i deposited under accession number ATCC 
98190; 

(e) a polynucleotide encoding the full length protein encoded by the 
cDNA insert of clone AJ142_li deposited under accession number ATCC 98190; 
25 (f) a polynucleotide comprising the nucleotide sequence of the mature 

protein coding sequence of clone AJ142_1 i deposited under accession number ATCC 
98190; 

(g) a polynucleotide encoding the mature protein encoded by the cDNA 
insert of clone AJ142_li deposited under accession number ATCC 98190; 
30 (h) a polynucleotide encoding a protein comprising the amino acid 

sequence of SEQ ID NO:23; 

(i) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO:23 having biological activity; 

(j) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 
35 (g) above; and 
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(k) a polynucleotide which encodes a species homologue of the protein 
of (h) or (i) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO:22 
from nucleotide 47 to nucleotide 517; the nucleotide sequence of SEQ ID NO:22 from 

5 nucleotide 1 16 to nucleotide 517; the nucleotide sequence of the full length protein coding 
sequence of clone AJ142_li deposited under accession number ATCC 98190; or the 
nucleotide sequence of the mature protein coding sequence of clone AJl42_li deposited under 
accession number ATCC 98190. In other preferred embodiments, the polynucleotide encodes 
the full length or mature protein encoded by the cDNA insert of clone AJ142_li deposited 

1 0 under accession number ATCC 98 1 90. 

In other embodiments, the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO:23; 
J 5 (b) fragments of the amino acid sequence of SEQ ID NO:23; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
AJ142_1 i deposited under accession number ATCC 98190; 
the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO:23. 
20 In one embodiment, the present invention provides a composition comprising an 

isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:24; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
25 NO:24 from nucleotide 3 1 2 to nucleotide 417; 

(c) a polynucleotide comprising the nucleotide sequence of the full length 
protein coding sequence of clone AK604_li deposited under accession number ATCC 
98190; 

(d) a polynucleotide encoding the full length protein encoded by the 
30 cDNA insert of clone AK604_1 i deposited under accession number ATCC 98 1 90; 

(e) a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone AK604_I i deposited under accession number ATCC 
98190; 

(0 a polynucleotide encoding the mature protein encoded by the cDNA 
35 insert of clone AK604_li deposited under accession number ATCC 98190; 

11 
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(g) a polynucleotide encoding a protein comprising the amino acid 
sequence of SEQ ID NO:25; 

(h) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO:25 having biological activity; 

5 (i) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 

(0 above; and 

(j) 3 polynucleotide which encodes a species homologue of the protein 
of (g) or (h) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO:24 
from nucleotide 312 to nucleotide 417; the nucleotide sequence of the full length protein 
coding sequence of clone AK604_1 i deposited under accession number ATCC 98190; or the 
nucleotide sequence of the mature protein coding sequence of clone AK604 li deposited 
under accession number ATCC 98 1 90. In other preferred embodiments, the polynucleotide 
encodes the full length or mature protein encoded by the cDNA insert of clone AK604_ 1 i 
1 5 deposited under accession number ATCC 98 1 90. 

In other embodiments, the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO:25; 

(b) fragments of the amino acid sequence of SEQ ID NO:25; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
AK604_li deposited under accession number ATCC 98190; 

the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO:25. 

In one embodiment, the present invention provides a composition comprising an 
isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:27; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
30 NO:27 from nucleotide 76 to nucleotide 372; 

(c) a polynucleotide comprising the nucleoUde sequence of the fiill length 
protein coding sequence of clone AK620_1 i deposited under accession number ATCC 



25 
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98190; 

(d) a polynucleotide encoding the full length protein encoded by the 
cDNA insert of clone AK620.1i deposited under acces.sion number ATCC 98190; 

12 
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(e) a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone AK620_ 1 i deposited under accession number ATCC 
98190; 

(0 a polynucleotide encoding the mature protein encoded by the cDNA 
5 insert of clone AK620_1 i deposited under accession number ATCC 98 1 90; 

(g) a polynucleotide encoding a protein comprising the amino acid 
sequence of SEQ ID NO:28; 

(h) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO:28 having biological activity; 

10 (i) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 

(f) above; and 

(j) a polynucleotide which encodes a species homologuc of the protein 
of (g) or (h) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO: 27 
1 5 from nucleotide 76 to nucleotide 372; the nucleotide sequence of the full length protein coding 
sequence of clone AK620_li deposited under accession number ATCC 98190; or the 
nucleotide sequence of the mature protein coding sequence of clone AK620_li deposited 
under accession number ATCC 98190. In other preferred embodiments, the polynucleotide 
encodes the full length or mature protein encoded by the cDNA insert of clone AK620_li 
20 deposited under accession number ATCC 98 1 90. 

In other embodiments, the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO:28; 
25 (b) fragments of the amino acid sequence of SEQ ID NO:28; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
AK620_l i deposited under accession number ATCC 98190; 
the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO:28. 
30 In one embodiment, the present invention provides a composition comprising an 

isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:29; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
35 NO:29 from nucleotide 367 to nucleotide 552; 
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(c) a polynucleotide comprising the nucleotide sequence of the full length 
protein coding sequence of clone AK650_1 i deposited under accession number ATCC 
98190; 

(d) a polynucleotide encoding the full length protein encoded by the 
5 cDNA insert of clone AK650_li deposited under accession number ATCC 98190; 

(e) a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone AK650_1 i deposited under accession number ATCC 
98190; 

(0 a polynucleotide encoding the mature protein encoded by the cDNA 
10 insert of clone AK650_li deposited under accession number ATCC 98 190; 

(g) a polynucleotide encoding a protein comprising the amino acid 
sequence of SEQ ID NO:30; 

(h) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO:30 having biological activity; 

^ ^ (i) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 

(0 above; and 

(j) a polynucleotide which encodes a species homologue of the protein 
of (g) or (h) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO:29 
from nucleotide 367 to nucleotide 552; the nucleotide sequence of the full length protein 
coding sequence of clone AK650_1 i deposited under accession number ATCC 98 1 90; or the 
nucleotide sequence of the mature protein coding sequence of clone AK650_li deposited 
under accession number ATCC 98190. In other preferred embodiments, the polynucleotide 
encodes the full length or mature protein encoded by the cDNA insert of clone AK650_H 
25 deposited under accession number ATCC 98 1 90. 

In other embodiments, the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from die group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO:30; 

(b) fragments of the amino acid sequence of SEQ ID NO:30; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
AK650_li deposited under accession number ATCC 98190; 

the protein being substantially free from otiier mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO:30. 
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In one embodiment, the present invention provides a composition comprising an 
isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:32; 

5 (b) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:32 from nucleotide 116 to nucleotide 310; 

(c) a polynucleotide comprising the nucleotide sequence of SEQ ID 
NO:32 from nucleotide 173 to nucleotide 310; 

(d) a polynucleotide comprising the nucleotide sequence of the full length 
10 protein coding sequence of clone AM226_]i deposited under accession number 

ATCC 98190; 

(e) a polynucleotide encoding the full length protein encoded by the 
cDNA insert of clone AM226_li deposited under accession number ATCC 98190; 

(f) a polynucleotide comprising the nucleotide sequence of the mature 
15 protein coding sequence of clone AM226_1i deposited under accession number 

ATCC 98190; 

(g) a polynucleotide encoding the mature protein encoded by the cDNA 
insert of clone AM226_1 i deposited under accession number ATCC 98 1 90; 

(h) a polynucleotide encoding a protein comprising the amino acid 

20 sequence of SEQ ID NO:33; 

(i) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO:33 having biological activity; 

(i) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 
(g) above; and 

25 (k) a polynucleotide which encodes a species homologue of the protein 

of (h) or (i) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO:32 
from nucleotide 116 to nucleotide 310; the nucleotide sequence of SEQ ID NO:32 from 
nucleotide 173 to nucleotide 310; the nucleotide sequence of the full length protein coding 
30 sequence of clone AM226_li deposited under accession number ATCC 98190; or the 
nucleotide sequence of the mature protein coding sequence of clone AM226_li deposited 
under accession number ATCC 98190. In other preferred embodiments, the polynucleotide 
encodes the full length or mature protein encoded by the cDNA insert of clone AM226_I i 
deposited under accession number ATCC 98190. 
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In other embodiments, the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO:33; 
5 (b) fragments of the amino acid sequence of SEQ ID NO:33; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
AM226_1 i deposited under accession number ATCC 98190; 
the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO:33. 
10 In one embodiment, the present invention provides a composition comprising an 

isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:35; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
15 NO:35 from nucleotide 281 to nucleotide 418; 

(c) a polynucleotide comprising the nucleotide sequence of SEQ ID 
NO:35 from nucleotide 353 to nucleotide 41 8; 

(d) a polynucleotide comprising the nucleotide sequence of the full length 
protein coding sequence of clone AR41 7_ 1 i deposited under accession number ATCC 

20 98190; 

(e) a polynucleotide encoding the full length protein encoded by the 
cDNA insert of clone AR41 7„1 i deposited under accession number ATCC 98 1 90; 

(0 a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone AR417.1 i deposited under accession number ATCC 
25 98190; 

(g) a polynucleotide encoding the mature protein encoded by the cDNA 
insert of clone AR417_li deposited under accession number ATCC 98190; 

(h) a polynucleotide encoding a protein comprising the amino acid 
sequence of SEQ ID NO:36; 

(0 a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO:36 having biological activity; 

(j) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 
(g) above; and 

(k) a polynucleotide which encodes a species homologue of the protein 
35 of (h) or (i) above. 
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Preferably, such polynucleoUde comprises the nucleotide sequence of SEQ ID NO:35 
from nucleotide 281 to nucleotide 41 8; the nucleotide sequence of SEQ ID NO:35 from 
nucleotide 353 to nucleotide 418; the nucleotide sequence of the full length protein coding 
sequence of clone AR417_li deposited under accession number ATCC 98190; or the 

5 nucleotidesequenceof the mature protein coding sequence of clone AR417Ji deposited under 
accession number ATCC 981 90. In other preferred embodiments, the polynucleotide encodes 
the full length or mature protein encoded by the cDNA insert of clone AR417_1 i deposited 
under accession number ATCC 98190. 

In other embodiments, the present invention provides a composition comprising a 

10 protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO:36; 

(b) fragments of the amino acid sequence of SEQ ID NO:36; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
15 AR417_1 i deposited under accession number ATCC 98190; 

the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO:36. 

m one embodiment, the present invention provides a composition comprising an 
isolated protein encoded by a polynucleotide selected from the group consisting of: 
20 (a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:38; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
NO:38 from nucleotide 496 to nucleotide 583; 

(c) a polynucleotide comprising the nucleotide sequence of SEQ ID 
25 NO:38 from nucleotide 565 to nucleotide 583; 

(d) a polynucleotide comprising the nucleotide sequence of the full length 
protein coding sequence of clone AU43Ji deposited under accession number ATCC 
98190; 

(e) a polynucleotide encoding the full length protein encoded by the 
30 cDNA insert of clone AU43_li deposited under accession number ATCC 98190; 

(0 a polynucleoUde comprising the nucleotide sequence of the mature 
protein coding sequence of clone AU43_1 i deposited under accession number ATCC 
98190; 

(g) a polynucleotide encoding the mature protein encoded by the cDNA 
35 insert of clone AU43_1 i deposited under accession number ATCC 98190; 

17 
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(h) a polynucleotide encoding a protein comprising the amino acid 
sequence of SEQ ID NO:39; 

(i) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO:39 having biological activity; 

5 0) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 

(g) above; and 

(k) a polynucleotide which encodes a species homologue of the protein 
of (h) or (i) above. 

Preferably, such polynucleotide comprises the nucleoUde sequence of SEQ ID NO:38 
from nucleotide 496 to nucleotide 583; the nucleotide sequence of SEQ ID NO:38 from 
nucleotide 565 to nucleotide 583; the nucleotide sequence of the full length protein coding 
sequence of clone AU43_1 i deposited under accession number ATCC 98 1 90; or the nucleotide 
sequence of the mature protein coding sequence of clone AU43_li deposited under accession 
number ATCC 98190. In other preferred embodiments, the polynucleotide encodes the full 
length or mature protein encoded by the cDNA insert of clone AU43_li deposited under 
accession number ATCC 98 1 90. 

In other embodiments, the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO:39; 

(b) fragments of the amino acid sequence of SEQ ID NO:39; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
AU43_li deposited under accession number ATCC 98190; 

the protein being substantially free from other mammalian proteins. Preferably such protein 
25 comprises the amino acid sequence of SEQ ID NO:39. 

In one embodiment, the present invention provides a composition comprising an 
isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:41; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
N0:41 from nucleotide 55 to nucleotide 405; 

(c) a polynucleotide comprising the nucleotide sequence of SEQ ID 
N0:4 1 froni nucleotide 1 48 to nucleotide 405; 
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(d) a polynucleotide comprising the nucleotide sequence of the full length 
protein coding sequence of clone AW60_l! deposited under accession number ATCC 
98190; 

(e) a polynucleotide encoding the full length protein encoded by the 
5 cDNA insert of clone AW60_li deposited under accession number ATCC 98190; 

(f) a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone AW60_li deposited under accession number ATCC 
98190; 

(g) a polynucleotide encoding the mature protein encoded by the cDNA 
10 insert of clone AW60_li deposited under accession number ATCC 98190; 

(h) a polynucleotide encoding a protein comprising the amino acid 

sequence of SEQ ID NO:42; 

(!) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO:42 having biological activity; 
15 (j) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 

(g) above; and 

(k) a polynucleotide which encodes a species homologue of the protein 
of (h) or (i) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID N0:41 
20 from nucleotide 55 to nucleotide 405; the nucleotide sequence of SEQ ID N0:41 from 
nucleotide 148 to nucleotide 405; the nucleotide sequence of the full length protein coding 
sequence of clone AW60_li deposited under accession number ATCC 98190; or the 
nucleotide sequence of the mature protein coding sequence of clone AW60_1 i deposited under 
accession number ATCC 98190. In other preferred embodiments, the polynucleotide encodes 
25 the full length or mature protein encoded by the cDNA insert of clone AW60_1 i deposited 
under accession number ATCC 98190. 

In other embodiments, the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

30 (a) the amino acid sequence of SEQ ID NO:42; 

(b) fragments of the amino acid sequence of SEQ ID NO:42; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
AW60_1 i deposited under accession number ATCC 98190; 

the protein being substantially free from other mammalian proteins. Preferably such protein 
35 comprises the amino acid sequence of SEQ ID NO:42. 
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In one embodiment, the present invention provides a composition comprising an 
isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:44; 

5 (b) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:44 from nucleotide 337 to nucleotide 525; 

(c) a polynucleotide comprising the nucleotide sequence of SEQ ID 
NO:44 from nucleotide 406 to nucleotide 525; 

(d) a polynucleotide comprising the nucleotide sequence of the full length 
10 protein coding sequence of clone BAl 76.1 i deposited under accession number ATCC 

98190; 

(e) a polynucleotide encoding the full length protein encoded by the 
cDNA insert of clone BA176„1i deposited under accession number ATCC 98190; 

(0 a polynucleotide comprising the nucleotide sequence of the mature 
15 protein coding sequence of clone BAl 76_1 i deposited under accession number ATCC 

98190; 

(g) a polynucleotide encoding the mature protein encoded by the cDNA 
insert of clone BAl 76_li deposited under accession number ATCC 98190; 

(h) a polynucleotide encoding a protein comprising the amino acid 
20 sequence of SEQ ID NO;45; 

(i) a polynucleotide encoding a protein comprising a fragment of the 
amino acid sequence of SEQ ID NO:45 having biological activity; 

(j) a polynucleotide which is an allelic variant of a polynucleotide of (a)- 
(g) above; and 

^ polynucleotide which encodes a species homologue of the protein 
of (h) or (i) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO:44 
from nucleotide 337 to nucleotide 525; the nucleotide sequence of SEQ ID NO:44 from 
nucleotide 406 to nucleotide 525; the nucleotide sequence of the full length protein coding 
30 sequence of clone BA176_li deposited under accession number ATCC 98190; or the 
nucleotide sequence of the mature protein coding sequence of clone BA176_li deposited under 
accession number ATCC 98190, In other preferred embodiments, the polynucleotide encodes 
the full length or mature protein encoded by the cDNA insert of clone BA176_I i deposited 
under accession number ATCC 98190. 
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In other embodiments, the present invention provides a composition comprising a 
protein, ^vherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO:45; 
5 (b) fragments of the amino acid sequence of SEQ ID NO:45; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
BA176_li deposited under accession number ATCC 98190; 
the protein being substantially free from other mammalian proteins. Preferably such protein 
comprises the amino acid sequence of SEQ ID NO:45. 
10 In one embodiment, the present invention provides a composition comprising an 

isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:47; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
1 5 NO:47 from nucleotide 536 to nucleotide 628; 

(c) apolynucleotide comprising the nucleotide sequence of the full length 
protein coding sequence of clone BD140_H deposited under accession number ATCC 
98190; 

(d) a polynucleotide encoding the full length protein encoded by the 
20 cDNA insert of clone BD140_li deposited under accession number ATCC 98190; 

(e) a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone BDl 40_1 i deposited under accession number ATCC 
98190; 

(0 a polynucleotide encoding the mature protein encoded by the cDNA 
25 insert of clone BD140_li deposited under accession number ATCC 98190; 

(g) a polynucleotide encoding a protein comprising the amino acid 

sequence of SEQ ID NO:48; 

(h) a polynucleotide encoding a protein comprising a fragment of the 

amino acid sequence of SEQ ID NO:48 having biological activity; 
30 (i) a polynucleotide v/hich is an allelic variant of a polynucleotide of (a)- 

(0 above; and 

(j) a polynucleotide which encodes a species homologue of the protein 
of (g) or (h) above. 

Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO:47 
35 from nucleotide 536 to nucleotide 628; the nucleotide sequence of the full length protein 
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coding sequence of clone BD140_li deposited under accession number ATCC 98190; or the 
nucleotidesequenceofthematureproteincodingsequenceofcloneBDMO 1 i deposited under 
accession number ATCC 98 ! 90. In other prefen^d embodiments, the polynucleotide encodes 
the full length or mature protein encoded by the cDNA insert of clone BDl 40_1 i deposited 
5 under accession number ATCC 98 1 90. 

In other embodiments, the present invention provides a composition comprising a 
protein, wherein said protein comprises an amino acid sequence selected from the group 
consisting of: 

(a) the amino acid sequence of SEQ ID NO:48; 
^® fragments of the amino acid sequence of SEQ ID NO:48; and 

(c) the amino acid sequence encoded by the cDNA insert of clone 
BD140_li deposited under accession number ATCC 98190; 
the protein being substantially fr^e from other mammalian proteins. P,^fcrably such protein 
comprises the amino acid sequence of SEQ ID NO:48. 
15 In one embodiment, the present invention provides a composition comprising an 

isolated protein encoded by a polynucleotide selected from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

NO:50; 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
20 NO:50 from nucleotide 303 to nucleotide 617; 

(c) a polynucleotide comprising the nucleotide .sequence of SEQ ID 
NO:50 from nucleotide 345 to nucleotide 617; 

(d) apolynucleotidecomprisingthenucleotidesequenceoftheftill length 
protein coding sequence of clone BD407_1 i deposited under accession number ATCC 

25 98190; 

(e) a polynucleotide encoding the full length protein encoded by the 
CDNA insert of clone BD407_li deposited under accession number ATCC 98190; 

(0 a polynucleotide comprising the nucleotide sequence of the mature 
protein coding sequence of clone BD407_1 i deposited under accession number ATCC 
30 98190; 

(g) a polynucleoUde encoding the mature protein encoded by the cDNA 
insert of clone BD407_li deposited under accession number ATCC 98190; 

(h) a polynucleotide encoding a protein comprising the amino acid 
sequence of SEQ ID NO:51 ; 



22 



PCT/US97/18032 

an,ino.ddsequenceofSEQroNO:51hav,„8biologi=alac.,v,,y; 

,p<„y„„dco.idewhichis^a„e.cvana„,ofapo,^>.c,eoudeof(aV 

'""T a P0.,nuc,.o,id= wMch encodes a specie. HcOosue of *e pro.ei„ 

of(h) or (i) above. ,cpnmNO-50 
P^fcrably such polynacleotide comprises *e nucleofde sequence of SEQ ID NO.50 
, 0. nllt 03 .0 n JeoUde 6,7; *e „.,e«,de sequence of SEQ ID NO.50 fro. 
.onuc,e„..e...enuc,eo«deseque„eeof.efu,,^^^^^ 

,0 sequence of Cone BO40,.,i deposUed under "^^^'l^;^^^ 

accessron number ATCC 98190. In J" ^,„^^„,„„„e BD407 J i deposited 

*efu,Uens*orma.„repro.m.nc^ed y*ecO ^^^^ 

, —"ler^ro:— 

t .e presen. — provides a compos^on eomprisi. a 
p.,ei„. lerein said pro..n comprises an amino acid sequence se,ec,ed from ,be .roup 

consisting of: 

(a) the amino acid sequence of SEQ ID NO:5 1 ; 

ccn m N0 51 from amino acid 1 to 
the amino acid sequence of SBQ ID Nuoi 



20 

(b) 



'^"'T' fra„nen«ofd,eaminoacidse,uenceofSHQ.ONO:5,:and 

d,e amino acid sequence encoied by .he cDNA ,nser, of Cone 

BD407 lidepositedunderaccessionnumberATCC98190; 
..prorbein;subsLa„vfreefromoU,ern.a,nma,ianpro.Cns. ; 
!:;risesd,eaIinoacidsequenceofSBQ>ONO:Mor,h.an,inoacidsequenee„fSBQ:D 

NO-51 from amino acid I to amino acid 32. ^^^cinc an 

in one embodiment, the present invention provides a compos.t.on com nsmg 
related protein encoded by a polynucleotide selected from the group consistmg of: 

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID 

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID 
NO:52 from nucleotide 178 to nucleotide 534; 



30 
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(c) ='P<>'y"<'*o.idec„mpnsi„g,h.„„c,eo,ides«,u=„ceof*eful,l=„ph 
'^^^^"'^^^of.^r.sn^j^^^,^,^^^^^^ ^^^^^^ 

5 cDNa '* '"'°''"*'^=f""''"S.hp.o«i„ encoded by .he 

CDNA .nsen of Cone BP290.,i deposited under aeeession number ATCC ,8,90 

pro-ein 1 " '"""""""'^ — 
p™^ncod.„gse,„e„ceofc,c«eBF290Jid^,,e,„^erac«,l„„„„„,^,^T^^ 

(0 »P°ly""'le"ide encoding d,en«„,„p™ei„ encoded by, he CDNA 
.n.«. Of Cone BF290_n deposited unde, accession no^ber ATCC98,L; 

(g) a polynucle 
sequence of SEQ ID NO:53; 
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(g) a polynucleotide encoding a protein comprising the' Ho acid 



am.no ac.d sequence of SEQ ID NO:53 having biological activity; 



(i) 

(0 above; and 
0) 



a polynucleotide which is an allelic variant of a polynucleotide of (a)- 



or(g)or.)abo:e"'~"'""^"^^^^"^^^^^ 
Preferably, such polynucleotide comprises the nucleotide sequence of SEQ ID NO-52 
20 fro. nucleofde ,78 to nucleotide 534; the nucleotide sequence of the full lenl 

coding sequence of clone BF:>Qn r ^ • . P'*'^^'" 
S eq «°f<='°neBF290_I.depos.tedunderaccessionnumberATCC98190 orthe 

nuc.eot.desequenceofthe.atu.p.teincodingsequenceofc,oneBP290 lidep i^ d^^^ 
accessionnumberATCC981«0 , . •""-"deposited under 

.hefu,l.en«h„r «.c polynucleotide encodes 

p™.ein;::::~:,::rrr"'""°"'^^^^^ 

consisting Of: ' " ^ f™-" 

(a) the amino acid sequence of SEQ ID NO:53; 

(b) fragments of the amino acid sequence of SEQ ID NO-53- and 

BF290 r d ^^^^ Of clone 

Bh290_l, deposited under accession number ATCC 98190- 

the p„,tein being substantially free from other mammalian proteins Prefer^h. . • 

comprisestheaminoacidsequenceofSEQ,DNO:53. ^^^^^^'^ -ch protem 
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p™„i„comp<>si.,oasof*=p.=.en.inv«.tionm.yf»«herco,npriseaphannaceu.ically 
acceptable carter. CempcsiUoos comprising an antibody which specifically .ae. with such 
protein are also provided by the present invention. 

Methods are also provided for preventing, treating or ameliorating a medical condition 
5 »hich compdses administering to a mammalian subject a therapeutically effective amoum of 
a composition comprising a protein of the present invention and a pharmaceutically accepuble 



earner. 
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RRTFF DESCP"^'^N OF FIGURES 
Fig. 1 is a schematic representation of the pED6 and pNotS vectors used for deposit 

of clones disclosed herein. 

Fig. 2 is an autoradiograph evidencing the expression of the following clone(s) 

disclosed herein: AE610_li. 

Fig. 3 is an autoradiograph evidencing the expression of the following clone(s) 

15 disclosed herein: AH106_li, AM226_li. 

Fig. 4 is an autoradiograph evidencing the expression of the following clone(s) 

disclosed herein: AH196_li. 

Fig. 5 is an autoradiograph evidencing the expression of the following clone(s) 

disclosed herein: AI6_li. 
20 Fig. 6 is an autoradiograph evidencing the expression of the following clone(s) 

disclosed herein: AR417_li. 

Fig. 7 is an autoradiograph evidencing the expression of the following clone(s) 

disclosed herein: AW60_li. 

Fig. 8 is an autoradiograph evidencing the expression of the following clone(s) 

25 disclosed herein: BDMO^li. 

Fig. 9 is an autoradiograph evidencing the expression of the following clone(s) 

disclosed herein: BF290_li. 

ppT/.nT;r>r>F5:rRIPTI0N 

30 TSni.ATED PROTEINS 

Nucleotide and amino acid sequences are reported below for each done and protetn 
disclosed in the pre^nt application. In son,e instances the sequences are preliminary and may 
include some incorrect or ambiguous bases or amino acids. The actual nucleotide sequence 
of each clone can readily be detennined by sequencing of the deposited clone in accordance 
35 with known methods. The predicted amino acid sequence (both full length and mature) can 

25 



wo 98/14470 

PCT/US97/18032 

then be determined from such nucleotide sequence. The amino acid sequence of the protein 
encoded by a particular clone can also be determined by expression of the clone in a suitable 
host cell, collecting the protein and determining its sequence. 

For each disclosed protein applicants have identified what they have determined to be 
5 the reading frame best identifiable with sequence information available at the time of filing. 
Because of the partial ambiguity in reported sequence information, reported protein sequences 
include "Xaa" designator.. These "Xaa" designators indicate either (1) a residue which cannot 
be Identified because of nucleotide sequence ambiguity or (2) a stop codon in the determined 
nucleotide sequence where applicants believe one should not exist (if the nucleotide sequence 
10 were determined more accurately). 

As used herein a "secreted" protein is one which, when expressed in a suitable host 
cell, is transported across or through a membrane, including transport as a result of signal 
sequences in its amino acid sequence. "Secreted" proteins include without limitation proteins 
secreted wholly (e.g., soluble proteins) or partially (e.g. . receptor.) from the cell in which they 
15 are expressed. "Secreted" proteins also include without limitation proteins which are 
transported across the membrane of the endoplpasmic reticulum. 

Protein "AF.4n'? li" 

One protein of the present invention has been identified as protein "AE402 1 i". A 
20 partial cDNA clone encoding AE402J i was first isolated from a murine adult spleen cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was detennined and searched against the GenBank 
database using BLASTATBLASTX and FASTA search prt,tocois. The search revealed at least 
some identity with ESTs reported by the I.M.A.G.E. Consortium identified as ••yh02hl2 ri 
25 Homo sapiens cDNA clone 42238 5'" (R60758. BlastN) and "yh02hl2.sl Homo sapiens 
CDNA clone 42238 3- (R60759, BlastN). The human cDNA clone corresponding to the EST 
database entry was ordered from Genome Systems. Inc.. St. Louis. Mo. a distributor of the 
I.M.A.G.E. Consoritum library. The clone received from the distributor was examined and 
determined to be a full length clone, including a 5" end and 3' UTR (including a polyA tail). 
30 This full-length clone is also referred to herein as "AE402_ 1 i". 

Applicants- methods identified clone AE402Ji as encoding a secreted protein. 
The nucleotide sequence of the 5' portion of AE402_1 i as presently determined is 
reported in SEQ ID N0:1. What applicants believe is the proper reading frame and the 
predicted amino acid sequence of the AE402_li protein con^sponding to the foregoing 
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nucleotide sequence is reported in SEQ ID N0:2. Additional nucleotide sequence from the 
3- portion of AE402_li. including the polyA tail, is reported in SEQ ID N0:3. 

Prntmn "AF61Q li" 

5 One protein of the present invention has been identified as protein "AEeiO.li". A 

partial cDNA clone encoding AE610_li was first isolated from a murine adult spleen cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 

10 some identity with ESTs reported by the I.M.A.G.E. Consortium identified as "yfl9g02.rl 
Homo sapiens cDNA" (R08399. Fasta), •■yw68d09.sl Homo sapiens cDNA clone 257393 3"' 
(N27 174. BlastN). "yilOa04.rl Homo sapiens cDNA" (R62698. Fasta) and "yhVSel 0.S1 Homo 

sapiens cDNA clone 135882 3"' (R33815, BlastN). The human cDNA done corresponding 
to the EST database entry was ordered from Genome Systems, Inc.. St. Louis, Mo. a distributor 
1 5 of the I M.A.G.E. Consoritum library. The done received from the distributor was examined 
and determined to be a full length done, including a 5' end and 3' UTR (including a polyA 
tail). This full-length done is also referred to herein as "AEeiO.li". 

Applicants" methods identified clone AE610_li as encoding a secreted protein, 
The nucleotide sequence of the 5' portion of AE610_li as presently detemiined is 
20 reported in SEQ ID N0:4. What applicants believe is the proper reading frame and die 
predicted amino add sequence of the AE610_li protdn corresponding to the foregoing 
nucleotide sequence is reported in SEQ ID NO:5. Amino acids 1 to 87 are the predicted 
leader/signal sequence, with the predicted mature amino acid sequence beginning at amino acid 
88. Additional nucleoUde sequence from the 3' portion of AE610_li. including the polyA tail. 
25 is reported in SEQ ID N0:6. 



Prntpin "AH106 li" 

One protein of the present invention has been identified as protein "AHlOe.li". A 
partial cDNA done encoding AH106_li was first isolated from a murine fetal thymus cDNA 

30 library using methods which ar« selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was detennined and searched against the GenBank 
database using BLASTA^LASTX and FASTA sean:h protocols. The search revealed at least 
some identity with an EST reported by the I.M.A.G.E. Consortium identified at GenBank 
accession number T81127. The human cDNA done corresponding to the EST database entry 

35 was ordered from Genome Systems, Inc.. St. Louis. Mo. a distributor of the I.M.A.G.E. 
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Consoritum librao^. The clone received from the distributor was examined and determined to 
be a full length clone, including a 5' end and 3' UTR (including a polyA tail). This full-length 
clone is also referred to herein as "AH106_1 i". 

Applicants' methods identified clone AH]06_] i as encoding a secreted protein. 
5 The nucleotide sequence of AH106_Ii as presently determined is reported in SEQ ID 

N0:7. What applicants believe is the proper reading frame and the predicted amino acid 
sequence of the AH106Ji protein corresponding to the foregoing nucleotide sequence is 
reported in SEQ ID N0:8.. 

10 Protein "AHlQfi H" 

One protein of the present invention has been identified as protein "AH196_li". A 
partial cDNA clone encoding AH196_1 i was first isolated from a murine fetal thymus cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
15 database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 
some identity with ESTs reported by the I.M.A.G.E. Consortium identified as "yjl2f04.rl 
Homo sapiens cDNA clone 148543 5- (H12523, BlastN) and "yjl2fl)4.sl Homo sapiens 
CDNA clone 148543 3- (H12470. BlastN). The human cDNA clone corresponding to the EST 
database entry was ordered from Genome Systems. Inc., St. Louis. Mo. a distributor of the 
20 I.M.A.G.E. Consoritum library. The clone received from the distributor was examined and 
determined to be a full length clone, including a 5' end and 3' UTR (including a polyA tail). 
This full-length clone is also referred to herein as "AHI96_li". 

Applicants' methods identified clone AHl 96_1 i as encoding a secreted protein. 
The nucleotide sequence of the 5' portion of AH196_1 i as presently determined is 
25 reported in SEQ ID N0:9. What applicants believe is the proper reading frame and the 
predicted amino acid sequence of the AH196_li protein corresponding to the foregoing 
nucleotide sequence is reported in SEQ ID NO:10. Additional nucleotide sequence from the 
3' portion of AH196_li. including the polyA tail, is reported in SEQ ID N0:1 1. 

30 Protein "Alft li" 

One protein of the present invention has been identified as protein "AI6_1 i". A partial 
cDNA clone encoding AI6_li was first isolated from a human blood cell (Thl or Th2) cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was detemuned and seaix:hed against the GenBank 

35 database using BLASTA^LASTX and FASTA search protocols. The search revealed at least 
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some identity with ESTs reported by the I.M.A.G.E. Consortium identified as ••yj42h04.rl 
Homo sapiens cDNA" (H03613. Fasta) and "yx60fl O.sl Homo sapiens cDNA clone 2661 55 
3"' (N21637, BlastN). The human cDNA clone corresponding to the EST database entry was 
ordered from Genome Systems. Inc.. St. Louis. Mo. a distributor of the I.M.A.G.E. Consontum 
5 library. The clone received from the distributor was examined and determined to be a full 
length clone, including a 5' end and 3' UTR (including a polyA tail). This full-length clone is 
also referred to herein as "AI6_li". 

Applicants' methods identified clone AI6_Ii as encoding a secreted protein. 

The nucleotide sequence of the 5' portion of AI6_li as presenUy determined is reported 
10 in SEQ ID NO: 12. What applicants believe is the proper reading frame and the predicted 
amino acid sequence of the AI6_1 i protein corresponding to the foregoing nucleotide sequence 
is reported in SEQ ID NO: 1 3. Additional nucleotide sequence from the 3' portion of AI6_1 i, 
including the polyA tail, is reported in SEQ ID NO: 14. 

15 Prntp.in"AJ13 H" 

One protein of the present invention has been identified as protein "AJ13_li". A 
partial cDNA clone encoding AJ13_li was first isolated from a human adult testes cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was detennined and searched against the GenBank 
20 databa.se using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 
some identity with ESTs reported by the I.M.A.G.E. Consortium identified as ••yo61h02.rl 
Homo sapiens cDNA clone 182451 5'" (H421 16. BlastN). "yr84a08.rl Homo sapiens cDNA 
clone 21 1 958 5'" (H75363. BlastN) and "yg83h03.sl Homo sapiens cDNA clone 401 48 3"* 
(R53978. BlastN). The human cDNA clone corresponding to the EST database entry was 
25 ordered from Genome Systems. Inc.. St. Louis. Mo, a distributor of the I.M.A.G.E. Consorilum 
library. The clone received from the distributor was examined and determined to be a full 
length clone, including a 5" end and 3' UTR (including a polyA tail). This full-length clone is 
also referred to herein as "AJlS.li". 

Applicants' methods identified clone AJ 1 3_1 i as encoding a secreted protein. 
30 The nucleotide sequence of the 5' portion of AJ13_li as presently determined is 

reported in SEQ ID N0:15. An additional internal nucleotide sequence from AJ13_li as 
presently determined is reported in SEQ ID NO: 16. What applicants believe is the proper 
reading frame and the predicted amino acid sequence encoded by such internal sequence is 
reported in SEQ ID N0:17. Additional nucleotide sequence from the 3' portion of AJ13_li. 
35 including the poIyA tail, is reported in SEQ ID NO:l 8. 
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Protein "AJ27 li" 

One protein of the present invention has^een identified as protein "AJ27_Ii". A 
partial cDNA clone encoding AJ27_li was first isolated from a human adult testes cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
5 nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 
some identity with ESTs reported by the I.M.A.G.E, Consortium identified as "yx25h01.rl 
Homo sapiens cDNA clone 262897 5"' (N28373, BlastN) and "yx62d05.rl Homo sapiens 
cDNA clone 26631 3 5'" (N35654, BlastN), The human cDNA clone corresponding to the EST 

10 database entry was ordered from Genome Systems, Inc., St. Louis, Mo. a distributor of the 
I.M.A.G.E. Consoritum hbrary. The clone received from the distributor was examined and 
determined to be a full length clone, including a 5' end and 3' UTR (including a polyA tail). 
This full-length clone is also referred to herein as "AJ27_li". 

Applicants' methods identified clone AJ27_li as encoding a secreted protein. 

15 The nucleotide sequence of the 5* portion of AJ27_li as presently determined is 

reported in SEQ ID NO: 19. What applicants believe is the proper reading frame and the 
predicted amino acid sequence of the AJ27_li protein corresponding to the foregoing 
nucleotide sequence is reported in SEQ ID NO:20. Amino acids 1 to 27 are the predicted 
leader/signal sequence, with the predicted mature amino acid sequence beginning at amino acid 
20 28. Additional nucleotide sequence from the 3' portion of AJ27_I i, including the polyA tail, 
is reported in SEQ ID NO:21 . 



Protein "AJ142 li" 

One protein of the present invention has been identified as protein "AJ142_li". A 
25 partial cDNA clone encoding AJ142_li was first isolated from a human adult testes cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 
some identity with ESTs reported by the I.M.A.G.E. Consortium identified as "yq85bl2.r] 
30 Homo sapiens cDNA clone 202559 5'" (H53268, BlastN) and "yq85bl2.sl Homo sapiens 
cDNA done 202559 3'" (H53269, BlastN). The human cDNA clone corresponding to the EST 
database entry was ordered from Genome Systems, Inc.. St. Louis, Mo, a distributor of the 
I.M.A.G.E. Consoritum library. The clone received from the distributor was examined and 
determined to be a full length clone, including a 5' end and 3* UTR (including a polyA tail), 
35 This full-length clone is also referred to herein as "AJ142_li". 
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Applicants' methods identified clone AJ142_li as encoding a secreted protein. 

The nucleotide sequence of AJ142_li as presently determined is reported in SEQ ID 
NO:22. What applicants believe is the proper reading frame and the predicted amino acid 
sequence of the AJ142_li protein corresponding to the foregoing nucleotide sequence is 
5 repotted in SEQ ID NO:23. Amino acids 1 to 23 are the predicted leader/signal sequence, with 
the predicted mature amino acid sequence beginning at amino acid 24. 

Pmtrin "AK6Q4 11" 

One protein of the present invention has been identified as protein "AK604_li". A 
10 partial cDNA clone encoding AK604_1 i was first isolated from a human feul kidney cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 
some identity with an EST reported by the I.M.A.G.E. Consortium identified as "ycSOgll.rl 
15 Homo sapiens cDNA clone 22157 5"' (T64857. BlastN). The sequence also showed at least 
some identity with a partial cDNA sequence identified as "H. sapiens partial cDNA sequence; 
clone c-1 pgl 1 " (Z40033. BlastN). The human cDNA clone corresponding to the EST database 
entry was ordered from Genome Systems. Inc., St. Louis, Mo, a distributor of the I.M.A.G.E. 
Consoritum library. The clone received from the distributor was examined and determined to 
20 be a full length clone, including a 5' end and 3' UTR (including a polyA tail). This full-lengdi 
clone is also referred to herein as "AK604_1 i". 

Applicants" methods identified clone AK604_li as encoding a secreted protein. 
The nucleotide sequence of the 5' portion of AK604_li as presently determined is 
reported in SEQ ID NO:24. What applicants believe is the proper reading frame and the 
25 predicted amino acid sequence of the AK604_li protein corresponding to the foregoing 
nucleotide sequence is reported in SEQ ID NO:25. Additional nucleotide sequence from the 
3' portion of AK604_li, including the polyA tail, is reported in SEQ ID NO:26. 

Prntf in "AK620 li" 

30 One protein of the present invention has been identified as protein "AK620_1 i". A 

partial cDNA clone encoding AK620_li was first isolated from a human feul kidney cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 

35 some identity with ESTs reported by the I.M.A.G.E. Consortium identified as ••ye7607.rl 
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Homo sapiens cDNA clone 123684 5'" (R02637. BlastN) and "yx90e05.sl Homo sapiens 
cDNA clone 269024 3"' {N2610] , BlastN). The human cDNA clone corresponding to the EST 
database entry was ordered from Genome Systems, Inc., St. Louis. Mo, a distributor of the 
I.M.A.G.E. Consoritum library. The clone received from the distributor wa.s examined and 
5 determined to be a full length clone, including a 5* end and 3' UTR (including a polyA tail). 
This full-length clone is also referred to herein as "AK620_li". 

Applicants' methods identified clone AK620_li as encoding a secreted protein. 

The nucleoUde sequence of AK620_li as presently determined is reported in SEQ ID 
NO:27. What applicants believe is the proper reading frame and the predicted amino acid 
10 sequence of the AK620_li protein corresponding to the foregoing nucleotide sequence is 
reported in SEQ ID NO:28.. 



Protein "AK65n 1i" 

One protein of the present invention has been identified as protein "AK650_li". A 

1 5 partial cDNA clone encoding AK650_1 i was first isolated from a human fetal kidney cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 
some identity with ESTs reported by the I.M.A.G.E. Consortium identified as "yp60g06.rl 

20 Homo sapiens cDNA clone 191866 5'" (H40407, BlastN) and "yp60g06.sl Homo sapiens 
cDNA clone 191 866 3"' (H40350, BlastN). The human cDNA clone corresponding to the EST 
database entry was ordered from Genome Systems. Inc., St. Louis, Mo, a distributor of the 
I.M.A.G.E. Consoritum library. The clone received from the distributor was examined and 
determined to be a full length clone, including a 5' end and 3' UTR (including a polyA tail). 

25 This full-length clone is also referred to herein as "AK650_1 i". 

Applicants' methods identified clone AK650_1 i as encoding a secreted protein. 
The nucleotide sequence of the 5' portion of AK650_li as presently determined is 
reported in SEQ ID NO:29. What applicants believe is the proper reading frame and the 
predicted amino acid sequence of the AK650_li protein corresponding to the foregoing 

30 nucleotide sequence is reported in SEQ ID NO:30. Additional nucleotide sequence from the 
3' portion of AK650_li, including the polyA tail, is reported in SEQ ID N0:31. 

Protein "AM226 11" 

One protein of the present invention has been identified as protein "AM226_Ii". A 
35 partial cDNA clone encoding AM226_1 i was first isolated from a human fetal kidney cDNA 

32 



wo 98/14470 PCT/US97/18032 

library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
database using BLASTA/BLASTX and FASTA search protocols. The search revealed at lea.st 
some identity with ESTs reported by the I.M.A.G.E. Consortium identified as "yf09a01.rl 
5 Homo sapiens cDNA clone 126312 5'" (R06469, BlastN) and "yy49b06.sl Homo sapiens 
cDNA clone 276851 3'" (N39415, BlastN). The sequence also showed some similarity with 
bovine osteoinductuve factor (OIF) (M37974, BlastN). with which it may share some activity. 
The human cDNA clone corresponding to the EST database entry was ordered from Genome 
Systems, Inc.. St. Louis, Mo, a distributor of the I.M.A.G.E. Consoritum library. The clone 
1 0 received from the distributor was examined and determined to be a full length clone, including 
a 5' end and 3' UTR (including a polyA tail). This full-length clone is also referred to herein 
as"AM226_li". 

Applicants" methods identified clone AM226_li as encoding a secreted protein. 
The nucleotide sequence of the 5' portion of AM226_li as presently determined is 
15 reported in SEQ ID NO:32. What applicants believe is the proper reading frame and the 
predicted amino acid sequence of the AM226_li protein corresponding to the foregoing 
nucleotide sequence is reported in SEQ ID NO:33. Amino acids I to 19 are the predicted 
leader/signal sequence, with the predicted mature amino acid sequence beginning at amino acid 
20. Additional nucleotide sequence from the 3" portion of AM226_li, including the polyA tail, 
20 is reported in SEQ ID NO:34. 

Protein "AR41 7 1i" 

One protein of the present invention has been identified as protein "AR4 1 7_1 1". A 
partial cDNA clone encoding AR417_li was first isolated from a human adult retina cDNA 

25 library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was detennined and searched against the GenBank 
database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 
some identity with ESTs reported by the I.M.A.G.E. Consortium identified at GenBank 
accession numbers R18973. R42209 (••yf89g09.sl Homo sapiens cDNA clone 29781 3"'), 

30 R12416 ("yf56a02.rl Homo sapiens cDNA clone 26106 5"') and R15309 (••yf89g09.rl Homo 
sapiens cDNA"). The human cDNA clone corresponding to the EST database entry was 
ordered irom Genome Systems, Inc., St. Louis. Mo, a distributor of the I.M.A.G.E. Consoritum 
library. The clone received from the distributor was examined and determined to be a full 
length clone, including a 5' end and 3* UTR (including a polyA tail). This full-length clone is 

35 also referred to herein as "AR417_li". 
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Applicants' methods identified clone AR41 7„1 i as encoding a secreted protein. 
The nucleotide sequence of the 5' portion of AR417„li as presently determined is 
reported in SEQ ID NO:35. What applicants believe is the proper reading frame and the 
predicted amino acid sequence of the AR417„li protein corresponding to the foregoing 
5 nucleotide sequence is reported in SEQ ID NO:36. Amino acids 1 to 24 are the predicted 
leader/signal sequence, with the predicted mature amino acid sequence beginning at amino acid 
25. Additional nucleotide sequence from the 3' portion of AR417_1 i, including the polyA tail, 
is reported in SEQ ID NO:37. 

10 Protein "AU43 li" 

One protein of the present invention has been identified as protein "AU43_li'*. A 
partial cDNA clone encoding AU43_li was first isolated from a human adult testes cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was determined and searched against the GenBank 

1 5 database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 
some identity with ESTs reported by the I.M.A.G.E. Consortium identified as "yi49f07.rl 
Homo sapiens cDNA clone 142597 5"' {R70850, BlastN) and "yd68e02.sl Homo sapiens 
cDNA clone 11 3402 3"' (T78464, BlastN). The human cDNA clone corresponding to the EST 
database entry was ordered from Genome Systems, Inc., St. Louis, Mo, a distributor of the 

20 I.M.A.G.E. Consoritum library. The clone received from the distributor was examined and 
determined to be a full length clone, including a 5' end and 3' UTR (including a polyA tail). 
This full-length clone is also referred to herein as "AU43_li". 

Applicants' methods identified clone AU43_1 i as encoding a secreted protein. 
The nucleotide sequence of the 5' portion of AU43_li as presently determined is 

25 reported in SEQ ID NO:38. What applicants believe is the proper reading frame and the 
predicted amino acid sequence of the AU43_1i protein corresponding to the foregoing 
nucleotide sequence is reported in SEQ ID NO:39. Amino acids 1 to 23 are the predicted 
leader/signal sequence, with the predicted mature amino acid sequence beginning at amino acid 
24. Additional nucleotide sequence from the 3' portion of AU43_li, including the polyA tail. 

30 is reported in SEQ ID NO:40. 

Protein "AW6Q li" 

One protein of the present invention has been identified as protein "AW60_li". A 
partial cDNA clone encoding AW60_]i was first isolated from a human ovary (PA-1 
35 teratocarcinoma) cDNA library using methods which are selective for cDNAs encoding 
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secreted proteins. The nucleotide sequence of such partial cDNA was determined and searched 
against the GenBank database using BLASTA/BLASTX and FASTA search protocols. The 
search revealed at least some identity with ESTs reported by the I.M.A.G.E. Consortium 
identified as "ym57fll.rl Homo sapiens cDNA clone 52343 5"' (H23492, BlastN), 
5 "ym57fD8.rl Homo sapiens cDNA" (H23390, Fasta) and "ym57f 1 1 .si Homo sapiens cDNA 
clone 52343 3'" (H23494. BlastN). The sequence also showed at least some identity with a 
sequence identified as "Homo sapiens clone S31il25" (L40397. Fasta)The human cDNA clone 
corresponding to the EST database entry was ordered from Genome Systems, Inc., St. Louis. 
Mo, a distributor of the I.M.A.G.E. Consoritum library. The clone received from the 

10 distributor was examined and determined to be a full length clone, including a 5' end and 3' 
UTR (including a polyA tail). This full-length clone is also referred to herein as "AW60_li". 
Applicants* methods identified clone AW60_1 i as encoding a secreted protein. 
The nucleotide sequence of the 5' portion of AW60_li as presently determined is 
reported in SEQ ID NO:41. What applicants believe is the proper reading frame and the 

15 predicted amino acid sequence of the AW60„li protein corresponding to the foregoing 
nucleotide sequence is reported in SEQ ID NO:42. Amino acids 1 to 31 are the predicted 
leader/signal sequence, with the predicted mature amino acid sequence beginning at amino acid 
32. Additional nucleotide sequence from the 3' portion of AW60_li, including the polyA tail, 
is reported in SEQ ID NO:43. 

20 

Protein "BA176 1i" 

One protein of the present invention has been identified as protein "BA176_li". A 
partial cDNA clone encoding BA176_li was first isolated from a human adult placenta cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 

25 nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 
some identity with ESTs reported by the I.M.A.G.E. Consortium identified as "yi75gll.rl 
Homo sapiens cDNA" (R77409, Fasta), "yj50bl2.rl Homo sapiens cDNA" (HO3089. Fasta) 
and"yi75gll.sl Homo sapiens cDNA clone 145124 3'" (R77410, BlastN). The human cDNA 

30 clone conresponding to the EST database entry was ordered from Genome Systems, Inc., St. 
Louis, Mo, a distributor of the LM.A.G.E. Consoritum library. The clone received from the 
distributor was examined and determined to be a full length clone, including a 5' end and 3' 
UTR (including a polyA tail). This full-length clone is also referred to herein as *'BA176„1 i". 
Applicants' methods identified clone BA176_li as encoding a secreted protein. 
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The nucleotide sequence of the 5' portion of BA176_li as presently determined is 
reported in SEQ ID NO:44. What applicants believe is the proper reading frame and the 
predicted amino acid sequence of the BA176_li protein corresponding to the foregoing 
nucleotide sequence is reported in SEQ ID NO:45. Amino acids 1 to 23 are the predicted 
5 leader/signal sequence, with the predicted mature amino acid sequence beginning at amino acid 
24. Additional nucleotide sequence from the 3* portion of B Al 76_1 i, including the polyA tail 
is reported in SEQ ID NO:46. 



Protein "BD140 li" 

10 One protein of the present invention has been identified as protein "BD140_li'*. A 

partial cDNA clone encoding BD140„li was first isolated from a human fetal kidney cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
database using BLAST A/BLASTX and FASTA search protocols. The search revealed at least 

15 some identity with ESTs reported by the I.M.A.G.E. Consortium identified as "yn98c02:rl 
Homo sapiens cDNA" (H43507, Fasta). "yn67g04.rl Homo sapiens cDNA" (H22693, Fasta) 
and "yn82e07.sl Homo sapiens cDNA clone 174948 3"' (H38408, BlaslN). The human cDNA 
clone correspondinjg to the EST database entry was ordered from Genome Systems, Inc.. St, 
Louis, Mo, a distributor of the I.M.A.G.E. Consoritum library. The clone received from the 

20 distributor was examined and determined to be a full length clone, including a 5' end and 3' 
UTR (including a polyA tail). This full-length clone is also referred to herein as "BD I40_l i". 
Applicants' methods identified clone BD140_1 i as encoding a secreted protein. 
The nucleotide sequence of the 5' portion of BD140_li as presently determined is 
reported in SEQ ID NO:47. What applicants believe is the proper reading frame and the 

25 predicted amino acid sequence of the BD140_li protein corresponding to the foregoing 
nucleotide sequence is reported in SEQ ID NO:48. Additional nucleotide sequence from the 
3' portion of BD140_li, including the polyA tail, is reported in SEQ ID NO:49. 

Protein "BD4Q7 li" 

30 One protein of the present invention has been identified as protein "BD407„li". A 

partial cDNA clone encoding BD407_li was first isolated from a human fetal kidney cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 

35 some identity with ESTs reported by the I.M.A.G.E. Consortium identified as "ys65a05.rl 
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Homo sapiens cDNA" (H84524, Fasta) and "y2l5h02.sl Homo sapiens cDNA clone 283155 
3"' (N51349, BlastN). The human cDNA clone corresponding to the EST database entry was 
ordered from Genome Systems. Inc., St. Louis, Mo, a distributor of the I.M.A.G.E. Consoritum 
library. The clone received from the distributor was examined and determined to be a full 
5 length clone, including a 5' end and 3' UTR (including a polyA tail). This full-length clone is 
also referred to herein as "BD407_1 i". 

Applicants* methods identified clone BD407_li as encoding a secreted protein. 

The nucleotide sequence of BD407_1 i as presently determined is reported in SEQ ID 
NO:50. What applicants believe is the proper reading frame and the predicted amino acid 
10 sequence of the BD407_li protein corresponding to the foregoing nucleotide sequence is 
reported in SEQ ID N0:51 . Amino acids 1 to 14 are the predicted leader/signal sequence, with 
the predicted mature amino acid sequence beginning at amino acid 15. 



Protein "BF290 li" 

15 One protein of the present invention has been identified as protein "BF290_li'*. A 

partial cDNA clone encoding BF290_li was first isolated from a human fetal brain cDNA 
library using methods which are selective for cDNAs encoding secreted proteins. The 
nucleotide sequence of such partial cDNA was determined and searched against the GenBank 
database using BLASTA/BLASTX and FASTA search protocols. The search revealed at least 

20 some identity with ESTs reported by the I.M.A.G.E. Consortium identified as *'yhl0fD4.rl 
Homo sapiens cDNA" (R61 165, Fasta) and "yy35dl2.i;l Homo sapiens cDNA clone 273239 
3'" (N33175, BlastN). The human cDNA clone corresponding to the EST database entry was 
ordered from Genome Systems, Inc., St. Louis, Mo, a distributor of the I.M.A.G.E. Consoritum 
library. The clone received from the distributor was examined and determined to be a full 

25 length clone, including a 5* end and 3' UTR (including a polyA tail). This full-length clone is 
also referred to herein as "BF290_li". 

Applicants' methods identified clone BF290_li as encoding a secreted protein. 
The nucleotide sequence of the 5' portion of BF290_li as presently determined is 
reported in SEQ ID NO:52. What applicants believe is the proper reading frame and the 

30 predicted amino acid sequence of the BF290„li protein corresponding to the foregoing 
nucleotide sequence is reported in SEQ ID NO:53. Additional nucleotide sequence from the 
3' portion of BF290_li, including the polyA tail, is reported in SEQ ID NO:54. 
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Deposit of Clones 

Clones AE402„li, AE610_li, AH106„li, AH196_Ii, AI6„li, AJ13_]i, AJ27_li, 
AJ142_li. AK604_li, AK620_li, AK650_li, AM226_li, AR417_li, AU43_li, AW60_li. 
BA176^1i, BD140_Ii, BD407„li and BF290_1 i were deposited on October 2, 1996 with the 
5 American Type Culture Collection under accession number ATCC 98 1 90, from which each 
clone comprising a particular polynucleotide is obtainable. Each clone has been transfected 
into separate bacterial cells (£. coli) in this composite deposit. 

Each clone can be removed from the vector in which it was deposited by performing 
an EcoRI/NotI digestion (5' cite, EcoRI; 3' cite, NotI) to produce the appropriate fragment for 
10 such clone. Each clone was deposited in either the pED6 or pNotS vector depicted in Fig. 1 , 
In some instances, the deposited clone can become "flipped" (i.e., in the reverse orientation) 
in the deposited isolate. In such instances, the cDNA insert can still be isolated by digestion 
with EcoRI and NotL However, NotI will then produce the 5' cite and EcoRI will produce the 
3' cite for placement of the cDNA in proper orientation for expression in a suitable vector. The 
1 5 cDNA may also be expressed from the vectors in which they were deposited. 

Bacterial cells containing a particular clone can be obtained from the composite 
deposit as follows: 

An oligonucleotide probe or probes should be designed to the sequence that is known 
for that particular clone. This sequence can be derived from the sequences provided herein, 
20 or from a combination of those sequences. 

In the sequences listed above which include an N at position 2, that position is 
occupied in preferred probes/primers by a biotinylated phosphoaramidite residue rather than 
a nucleotide (such as , for example, that produced by use of biolin phosphoramidite (1- 

dimethoxytrityloxy-2-{N-biotinyl-4-aniinobutyl)-propyl-3-0-(2-cyanoethylHN,N-diisoprop 
25 phosphoramadite) (Glen Research, cat. no. 1 0-1 953)). 

The design of the oligonucleotide probe should preferably follow these parameters: 

(a) It should be designed to an area of the sequence which has the fewest 
ambiguous bases ("N's"), if any; 

(b) It should be designed to have a T„ of approx. 80 C (assuming 2^ for each A 
30 or T and 4 degrees for each G or C). 

The oligonucleotide should preferably be labeled with g-^^p ^jp (specific activity 6000 
Ci/mmole) and T4 polynucleotide kinase using commonly employed techniques for labeling 
oligonucleotides. Other labeling techniques can also be used. Unincorporated label should 
preferably be removed by gel filtration chromatography or other established methods. The 
35 amount of radioactivity incorporated into the probe should be quantitated by measurement in 

38 



wo 98/14470 PCT/US97/18032 

a scintillation counter. Preferably, specific activity of the resulting probe should be 
approximately 4e+6 dpm/pmole. 

The bacterial culture containing the pool of full-length clones should preferably be 
thawed and 100 |il of the stock used to inoculate a sterile culture flask containing 25 ml of 
5 sterile L-broth containing ampicillin at 100 pg/ml. The culture should preferably be grown to 
saturation at 37°C, and the saturated culture should preferably be diluted in fresh L-broth. 
Aliquots of these dilutions should preferably be plated to determine the dilution and volume 
which will yieild approximately 5000 distinct and well-separated colonies on solid 
bacteriological media containing L-broth containing ampicillin at 100 pg/ml and agar at 1.5% 
10 in a 150 mm petri dish when grown ovemight at 37°C. Other known methods of obtaining 
distinct, well-separated colonies can also be employed. 

Standard colony hybridization procedures should then be used to transfer the colonies 
to nitrocellulose filters and lyse, denature and bake them. 

The filter is then preferably incubated at 65°C for 1 hour with gentle agitation in 6X 
1 5 SSC (20X stock is 1 75.3 g NaCl/Iiter, 88.2 g Na citrate/liter, adjusted to pH 7.0 with NaOH) 
containing 0.5% SDS, 100 pg/ml of yeast RNA, and 10 mM EDTA (approximately 10 mL per 
150 mm filter). Preferably, the probe is then added to the hybridization mix at a concentration 
greater than or equal to le+6 dpm/mL. The filter is then preferably incubated at 65°C with 
gentle agitation ovemight. The filter is then preferably washed in 500 mL of 2X SSC/0.5% 
20 SDS at room temperature without agitation, preferably followed by 500 mL of 2X SSC/0. 1 % 
SDS at room temperature with gentle shaking for 15 minutes, A third wash with O.IX 
SSC/0.5% SDS at 65°C for 30 minutes to 1 hour is optional. The filter is then preferably dried 
and subjected to autoradiography for sufficient time to visualize the positives on the X-ray 
film. Other known hybridization methods can also be employed, 
25 The positive colonies are picked, grown in culture, and plasmid DNA isolated using 

standard procedures. The clones can then be verified by restriction analysis, hybridization 
analysis, or DNA sequencing. 

Fragments of the proteins of the present invention which arc capable of exhibiting 
biological activity are also encompassed by the present invention. Fragments of the protein 
30 may be in linear form or they may be cyclized using known methods, for example, as described 
in H.U. Saragovi. etai, BioH'echnology 10, 773-778 (1992) and in R.S. McDowell, etai. J. 
Amer. Chem. Soc. Ji4, 9245-9253 (1992). both of which are incorporated herein by reference. 
Such fragments may be fiised to carrier molecules such as immunoglobulins for many 
purposes, including increasing the valency of protein binding sites. For example, fragments 
35 of the protein may be fused through "linker" sequences to the Fc portion of an 
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immunoglobulin. For a bivalent form of the protein, such a fusion could be to the Fc portion 
of an IgG molecule. Other immunoglobulin isotypes may also be used to generate such 
fusions. For example, a protein - IgM fusion would generate a decavalent form of the protein 
of the invention. 

5 The present invention also provides both full-length and mature forms of the disclosed 

proteins. The full-length form of the such proteins is identified in the sequence listing by 
translation of the nucleotide sequence of each disclosed clone. The mature form of such 
protein may be obtained by expression of the disclosed full-length polynucleotide (preferably 
those deposited with ATCC) in a suitable mammalian cell or other host cell. The sequence of 
10 the mature form of the protein may also be determinable from the amino acid sequence of the 
full-length form. 

Where the protein of the present invention is membrane-bound (e.g., is a receptor), the 
present invention also provides for soluble forms of such protein. In such forms part or all of 
the intracellular and transmembrane domains of the protein are deleted such that the protein 
15 is fully secreted from the cell in which it is expressed. The intracellular and transmembrane 
domains of proteins of the invention can be identified in accordance with known techniques 
for determination of such domains from sequence information. 

Species homologs of the disclosed proteins are also provided by the present invention. 
Species homologs may be isolated and identified by making suitable probes or primers from 
20 the sequences provided herein and screening a suitable nucleic acid source from the desired 
species. 

The invention also encompasses allelic variants of the disclosed proteins; that is, 
naturally-occurring alternative forms of the isolated proteins which are identical, homologous 
or related to that encoded by the polynucleotides disclosed herein. 

25 The isolated polynucleotide endcoing the protein of the invention may be operably 

linked to an expression control sequence such as the pMT2 or pED expression vectors 
disclosed in Kaufman et al. Nucleic Acids Res. 19, 4485-4490 (1991 ), in order to produce the 
protein recombinantly. Many suitable expression control sequences are known in the art. 
General methods of expressing recombinant proteins are also known and are exemplified in 

30 R. Kaufman, Methods in Enzymology 185, 537-566 (1990). As defined herein "operably 
linked'* means that the isolated polynucleotide of the invention and an expression control 
sequence are situated within a vector or cell in such a way that the protein is expressed by a 
host cell which has been transfonmed (transfected) with the ligaied polynucleotide/expression 
control sequence. 
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A number of types of cells may act as suitable host cells for expression of the protein. 
Mammalian host cells include, for example, monkey COS cells, Chinese Hamster Ovary 
(CHO) cells, human kidney 293 cells, human epidermal A431 cells, human Colo205 cells, 3T3 
cells, CV-1 cells, other transformed primate cell lines, normal diploid cells, cell strains derived 
5 from in vitro culture of primary tissue, primary explants, HeLa cells, mouse L cells, BHK, HL- 
60, U937, HaK or Jurkat cells. 

Alternatively, it may be possible to produce the protein in lower eukaryotes such as 
yeast or in prokaryotes such as bacteria. Potentially suitable yeast strains include 
Saccharomyces cerevisiae, Schizosaccharomyces pombe, Kluyveromyces strains, Candida, or 
10 any yeast strain capable of expressing heterologous proteins. Potentially suitable bacterial 
strains include Escherichia coli, Bacillus subtilis. Salmonella ryphimurium, or any bacterial 
strain capable of expressing heterologous proteins. If the protein is made in yeast or bacteria, 
it may be necessary to modify the protein produced therein, for example by phosphorylation 
or glycosylation of the appropriate sites, in order to obtain the functional protein. Such 
15 covalent attachments may be accomplished using known chemical or enzymatic methods. 

The protein may also be produced by operably linking the isolated polynucleotide of 
the invention to suitable control sequences in one or more insect expression vectors, and 
employing an insect expression system. Materials and methods for baculovirus/insect cell 
expression systems are commercially available in kit form from, e.g., Invitrogen» San Diego. 
20 California, U.S.A. (the MaxBac® kit), and such methods are well known in the art, as 
described in Summers and Smith, Texas Agricultural Experiment Station Bulletin No. 1555 
(1987) . incorporated herein by reference, As used herein, an insect cell capable of expressing 
a polynucleotide of the present invention is "transformed." 

The protein of the invention may be prepared by culturing transformed host cells under 
25 culture conditions suitable to express the recombinant protein. The resulting expressed protein 
may then be purified from such culture (i.e.. from culture medium or cell extracts) using known 
purification processes, such as gel filtration and ion exchange chromatography. The 
purification of the protein may also include an affinity column containing agents which will 
bind to the protein; one or more column steps over such affinity resins as concanavalin A- 
30 agarose, heparin-toyopearl® or Cibacrom blue 3GA Sepharose®; one or more steps involving 
hydrophobic interaction chromatography using such resins as phenyl ether, butyl ether, or 
propyl ether; or immunoaffinity chromatography. 

Alternatively, the protein of the invention may also be expressed in a form which will 
facilitate purification. For example, it may be expressed as a fusion protein, such as those of 
35 maltose binding protein (MBP), glutathione-S-transferase (GST) or thioredoxin (TRX). Kits 
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for expression and purification of such fusion proteins are commercially available from New 
England BioLab (Beverly, MA). Pharmacia (Piscataway, NJ) and InVitrogcn, respectively. 
The protein can also be tagged with an epitope and subsequently purified by using a specific 
antibody directed to such epitope. One such epitope ("Flag") is commercially available from 

5 Kodak (New Haven. CT). 

Finally, one or more reverse-phase high performance liquid chromatography (RP- 
HPLC) steps employing hydrophobic RP-HPLC media, e.g.. silica gel having pendant methyl 
or other aliphatic groups, can be employed to further purify the protein. Some or all of the 
foregoing purification steps, in various combinations, can also be employed to provide a 

10 substantially homogeneous isolated recombinant protein. The protein thus purified is 
substantially free of other mammalian proteins and is defined in accordance with the present 
invention as an "isolated protein." 

The protein of the invention may also be expressed as a product of transgenic animals, 
e.g., as a component of the milk of transgenic cows, goats, pigs, or sheep which are 

1 5 characterized by somatic or germ cells containing a nucleotide sequence encoding the protein. 

The protein may also be produced by known conventional chemical synthesis. 
Methods for constructing the proteins of the present invention by synthetic means are known 
to those skilled in the art. The synthetically-constructed protein sequences, by virtue of sharing 
primary, secondary or tertiary structural and/or conformational characteristics with proteins 

20 may possess biological projjerties in common therewith, including protein activity. Thus, they 
may be employed as biologically active or immunological substitutes for natural, purified 
proteins in screening of therapeutic compounds and in immunological processes for the 
development of antibodies. 

The proteins provided herein also include proteins characterized by amino acid 

25 sequences similar to those of purified proteins but into which modification are naturally 
provided or deliberately engineered. For example, modifications in the peptide or DNA 
sequences can be made by those skilled in the art using known techniques. Modifications of 
interest in the protein sequences may include the alteration, substitution, replacement, insertion 
or deletion of a selected amino acid residue in the coding sequence. For example, one or more 

30 of the cysteine residues may be deleted or replaced with another amino acid to alter the 
conformation of the molecule. Techniques for such alteration, substitution, replacement, 
insertion or deletion are well known to those skilled in the art (see, e.g., U.S. Patent No. 
4,51 8.584). Preferably, such alteration, substitution, replacement, insertion or deletion retains 
the desired activity of the protein. 
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Other fragments and derivatives of the sequences of proteins which v/ou\d be expected 
to retain protein activity in whole or in part and may thus be useful for screening or other 
immunological methodologies may also be easily made by those skilled in the art given the 
disclosures herein. Such modifications are believed to be encompassed by the present 
5 invention. 



USES AND BIOLOGICAL ACTIVITY 

The proteins of the present invention are expected to exhibit one or more of the uses 
or biological activities (including those associated with assays cited herein) identified below. 
10 Uses or activities described for proteins of the present invention may be provided by 
administration or use of such proteins or by administration or use of polynucleotides encoding 
such proteins (such as, for example, in gene therapies or vectors suitable for introduction of 
DNA). 



15 Research Uses and Utilities 

The proteins provided by the present invention can similarly be used in assay to 
determine biological activity, including in a panel of multiple proteins for high-throughput 
screening; to raise antibodies or to elicit another immune response; as a reagent (including the 
labeled reagent) in assays designed to quantitatively determine levels of the protein (or its 

20 receptor) in biological fluids; as markers for tissues in which the corresponding protein is 
preferentially expressed (either constituiively or at a particular stage of tissue differentiation 
or development or in a disease slate); and, of course, to isolate correlative receptors or ligands. 
Where the protein binds or potentially binds to another protein (such as, for example, in a 
receptor-ligand interaction), the protein can be used to identify the other protein with which 

25 binding occurs or to identify inhibitors of the binding interaction. Proteins involved in these 
binding interactions can also be used to screen for peptide or small molecule inhibitors or 
agonists of the binding interaction. 

Any or all of these research utilities are capable of being developed into reagent grade 
or kit format for commercialization as research products. 

30 Methods for performing the uses listed above are well known to those skilled in the 

art. References disclosing such methods include without limitation "Molecular Cloning: A 
Laboratory Manual", 2d ed.. Cold Spring Harbor Laboratory Press, Sambrook, J., E.F. Fritsch 
and T. Maniatis eds., 1989. and "Methods in Enzymology: Guide to Molecular Cloning 
Techniques", Academic Press, Berger, S.L. and A.R. Kimmel eds., 1987. 

35 
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Nutritional Uses 

Proteins of the present invention can also be used as nutritional sources or 
supplements. Such uses include without limitation use as a protein or amino acid supplement, 
use as a carbon source, use as a nitrogen source and use as a source of carbohydrate. In such 
5 cases the protein of the invention can be added to the feed of a particular organism or can be 
administered as a separate solid or liquid preparation, such as in the form of powder, pills, 
solutions, suspensions or capsules. In the case of microorganisms, the protein of the invention 
can be added to the medium in or on which the microorganism is cultured. 

10 Cytokine and Cell Proliferation/Differentiation Activity 

A protein of the present invention may exhibit cytokine, cell proliferation (either 
inducing or inhibiting) or cell differentiation (either inducing or inhibiting) activity or may 
induce production of other cytokines in certain cell populations. Many protein factors 
discovered to date, including all known cytokines, have exhibited activity in one or more factor 

15 dependent cell proliferation assays, and hence the assays serve as a convenient confirmation 
of cytokine activity. The activity of a protein of the present invention is evidenced by any one 
of a number of routine factor dependent cell proliferation assays for cell lines including, 
• without limitation, 32D, DA2, DAIG, TIO, B9, B9/1 1, BaF3, MC9/G, M+ (preB M+), 2E8, 
RB5, DAI, 123, Tl 165, HT2, CTLL2, TF-1, Mo7e and CMK. 

20 The activity of a protein of the invention may, among other means, be measured by the 

following methods: 

Assays for T-cell or thymocyte proliferation include without limitation those described 
in; Current Protocols in Immunology, Ed by J. E. Coligan, A.M. Kruisbeek, D.H. Margulies, 
E.M. Shevach, W Strober, Pub. Greene Publishing Associates and Wiley-Interscience (Chapter 
25 3, In Vitro assays for Mouse Lymphocyte Function 3.1-3.19; Chapter 7, Immunologic studies 
in Humans); Takai et ah, J. Immunol. 137:3494-3500, 1986; Bertagnolli et al., J. Immunol. 
145:1706-1712, 1990; Bertagnolli el al.. Cellular Immunology 133:327-341, 1991; 
Bertagnolli, et al., J. Immunol 149:3778-3783, 1992; Bowman et al., J. Immunol. 152: 1756- 
1761, 1994. 

30 Assays for cytokine production and/or proliferation of spleen cells, lymph node cells 

or thymocytes include, without limitation, those described in: Polyclonal T cell stimulation, 
Kruisbeek, A.M. and Shevach, E.M. In Current Protocols in Immunology. J.E.e.a. Coligan 
eds. Vol 1 pp. 3.12.1-3.12.14, John Wiley and Sons, Toronto. 1994; and Measurement of 
mouse and human Interferon y. Schreiber, R,D. In Current Protocols in Immunology, J.E.e.a. 

35 Coligan eds. Vol 1 pp. 6.8.1-6.8.8, John Wiley and Sons, Toronto. 1994. 
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Assays for proliferation and differentiation of hematopoietic and lymphopoietic cells 
include, without linaitation, those described in: Measurement of Human and Murine Interleukin 
2 and Interleukin 4, Bottomly. K., Davis. L.S. and Lipsky, P.E. In Current Protocols in 
Immunology. J.E.e.a. Coligan eds. Vol 1 pp. 6.3.1-6.3.12, John Wiley and Sons, Toronto, 
5 1991; deVries et al.. J. Exp. Med. 173:1205-121 1, 1991 ; Moreau et a!.. Nature 336:690-692, 
1988; Greenberger et al., Proc. Natl. Acad. Sci. U.S.A. 80:2931-2938. 1983; Measurement of 
mouse and human interleukin 6 - Nordan. R. In Current Protocols in Immunology, J.E.e.a. 
Coligan eds. Vol 1 pp. 6.6.1-6,6.5. John Wiley and Sons, Toronto. 1991; Smith et al.. Proc. 
Natl. Acad. Sci. U.S.A. 83:1857-1861, 1986; Measurement of human Interleukin 1 1 - Bennett, 
10 R, Giannotti, J.. Clark, S.C. and Turner, K. J. In Current Protocols in Immunology. J.E.e.a. 
Coligan eds. Vol 1 pp. 6.15.1 John Wiley and Sons. Toronto. 1991 ; Measurement of mouse 
and human Interleukin 9 - Ciarletta, A.. Giannotti, J., Clark. S.C. and Turner, K.J. In Current 
Protocols in Immunology. J.E.e.a. Coligan eds. Vol 1 pp. 6.13.1, John Wiley and Sons, 
Toronto. 1991. 

15 Assays for T-cell clone responses to antigens (which will identify, among others, 

proteins that affect APC-T cell interactions as well as direct T-cell effects by measuring 
proliferation and cytokine production) include, without limitation, those described in: Current 
Protocols in Immunology, Ed by J. E. Coligan. A.M. Kruisbeek, D.H. Margulies. E.M. 
Shevach. W Strober. Pub. Greene Publishing Associates and Wiley-Inierscience (Chapter 3, 

20 In Vitro assays for Mouse Lymphocyte Function; Chapter 6. Cytokines and their cellular 
receptors; Chapter 7. Immunologic studies in Humans); Weinberger et al., Proc. Natl. Acad. 
Sci. USA 77:6091-6095, 1980; Weinberger et al, Eur. J. Immun. 11:405-411, 1981;Takai 
et al., J. Immunol. 137:3494-3500, 1986; Takai et al., J, Immunol. 140:508-512. 1988. 

25 Immune Stimulating or Suppressing Activity 

A protein of the present invention may also exhibit immune stimulating or immune 
suppressing activity, including without limitation the activities for which assays are described 
herein. A protein may be useful in the treatment of various immune deficiencies and disorders 
(including severe combined immunodeficiency (SCID)), e.g., in regulating (up or down) 

30 growth and proliferation of T and/or B lymphocytes, as well as effecting the cytolytic activity 
of NK cells and other cell populations. These immune deficiencies may be genetic or be 
caused by viral (e.g., HIV) as well as bacterial or fungal infections, or may result from 
autoimmune disorders. More specifically, infectious diseases causes by viral, bacterial, fungal 
or other infection may be treatable using a protein of the present invention, including 

35 infections by HIV. hepatitis viruses, herpesviruses, mycobacteria. Leishmania spp., malaria 
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spp. and various fungal infections such as candidiasis. Of course, in this regard, a protein of 
the present invention may also be useful where a boost to the immune system generally may 
be desirable, i.e., in the treatment of cancer. 

Autoimmune disorders which may be treated using a protein of the present invention 
5 include, for example, connective tissue disease, multiple sclerosis, systemic lupus 
erythematosus, rheumatoid arthritis, autoimmune pulmonary inflammation, Guillain-Barrc 
syndrome, autoimmune thyroiditis, insulin dependent diabetes mellitis, myasthenia gravis, 
graft-versus-host disease and autoimmune inflammatory eye disease. Such a protein of the 
present invention may also to be useful in the treatment of allergic reactions and conditions, 
1 0 such as asthma (particularly allergic asthma) or other respiratory problems. Other conditions, 
in which immune suppression is desired (including, for example, organ transplantation), may 
also be treatable using a protein of the present invention. 

Using the proteins of the invention it may also be possible to immune responses, in a 
number of ways. Down regulation may be in the form of inhibiting or blocking an immune 
15 response already in progress or may involve preventing the induction of an immune response. 
The functions of activated T cells may be inhibited by suppressing T cell responses or by 
inducing specific tolerance in T cells, or both. Immunosuppression of T cell responses is 
generally an active, non-antigen-specific, process which requires continuous exposure of the 
T cells to the suppressive agent. Tolerance, which involves inducing non-responsiveness or 
20 anergy in T cells, is distinguishable from immunosuppression in that it is generally antigen- 
specific and persists after exposure to the tolerizing agent has ceased. Operationally, tolerance 
can be demonstrated by the lack of a T cell response upon reexposure to specific antigen in the 
absence of the tolerizing agent. 

Down regulating or preventing one or more antigen functions (including without 
25 limitation B lymphocyte antigen functions (such as , for example, B7)), e.g., preventing high 
level lymphokine synthesis by activated T cells, will be useful in situations of tissue, skin and 
organ transplantation and in graft-versus-host disease (GVHD). For example, blockage of T 
cell function should result in reduced tissue destruction in tissue transplantation. Typically, 
in tissue transplants, rejection of the transplant is initiated through its recognition as foreign 
30 by T cells, followed by an immune reaction that destroys the transplant. The administration 
of a molecule which inhibits or blocks interaction of a B7 lymphocyte antigen with its natural 
ligand(s) on immune cells (such as a soluble, monomeric form of a peptide having B7-2 
activity alone or in conjunction with a monomeric form of a peptide having an activity of 
another B lymphocyte antigen {e.g., B7- 1 , B7-3) or blocking antibody), prior to transplantation 
35 can lead to the binding of the molecule to the natural ligand(s) on the immune cells without 
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transmitting the corresponding costimulatory signal. Blocking B lymphocyte antigen function 
in this matter prevents cytokine synthesis by immune cells, such as T cells, and thus acts as an 
immunosuppressant. Moreover, the lack of costimulation may also be sufficient to anergize 
the T cells, thereby inducing tolerance in a subject. Induction of long-term tolerance by B 
5 lymphocyte antigen-blocking reagents may avoid the necessity of repeated administration of 
these blocking reagents. To achieve sufficient immunosuppression or tolerance in a subject, 
it may also be necessary to block the function of a combination of B lymphocyte antigens. 

The efficacy of particular blocking reagents in preventing organ transplant rejection 
or GVHD can be assessed using animal models that are predictive of efficacy in humans. 

10 Examples of appropriate systems which can be used include allogeneic cardiac grafts in rats 
and xenogeneic pancreatic islet cell grafts in mice, both of which have been used to examine 
the immunosuppressive effects of CTLA4Ig fusion proteins in vivo as described in Lenschow 
€t aL Science 257:789-792 (1992) and Turka et ai. Proc. Natl. Acad. Sci USA, 89: 1 1 102- 
11 105 (1992). In addition, murine models of GVHD (see Paul ed., Fundamental Immunology, 

1 5 Raven Press, New York, 1 989, pp. 846-847) can be used to determine the effect of blocking 
B lymphocyte antigen function in vivo on the development of that disease. 

Blocking antigen function may also be therapeutically useful for treating autoimmune 
diseases. Many autoimmune disorders are the result of inappropriate activation of T cells that 
are reactive against self tissue and which promote the production of cytokines and 

20 autoantibodies involved in the pathology of the diseases. Preventing the activation of 
autoreactive T cells may reduce or eliminate disease symptoms. Administration of reagents 
which block costimulation of T cells by disrupting receptor: ligand interactions of B 
lymphocyte antigens can be used to inhibit T cell activation and prevent production of 
autoantibodies or T cell-derived cytokines which may be involved in the disease process. 

25 Additionally, blocking reagents may induce antigen-Sf)ecific tolerance of autoreactive T cells 
which could lead to long-term relief from the disease. The efficacy of blocking reagents in 
preventing or alleviating autoimmune disorders can be determined using a number of well- 
characterized animal models of human autoimmune diseases. Examples include murine 
experimental autoimmune encephalitis, systemic lupus erythmatosis in MRUlpr/lpr mice or 

30 NZB hybrid mice, murine autoimmune collagen arthritis, diabetes mellitus in NOD mice and 
BB rats, and murine experimental myasthenia gravis (see Paul ed.. Fundamental Immunology, 
Raven Press, New York. 1989, pp. 840-856). 

Upregulation of an antigen function (preferably a B lymphocyte antigen function), as 
a means of up regulating immune responses, may also be useful in therapy. Upregulation of 

35 immune responses may be in the form of enhancing an existing immune response or eliciting 
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an initial immune response. For example, enhancing an immune response through stimulating 
B lymphocyte antigen function may be useful in cases of viral infection. In addition, systemic 
viral diseases such as influenza, the common cold, and encephalitis might be alleviated by the 
administration of stimulatory forms of B lymphocyte antigens systemically. 
5 Alternatively, anti-viral immune responses may be enhanced in an infected patient by 

removing T cells from the patient, costimulating the T cells in vitro with viral antigen-pulsed 
APCs either expressing a peptide of the present invention or together with a stimulatory form 
of a soluble peptide of the present invention and reintroducing the in vitro activated T cells into 
the patient. Andther method of enhancing anti-viral immune responses would be to isolate 

10 infected cells from a patient, transfect them with a nucleic acid encoding a protein of the 
present invention as described herein such that the cells express all or a portion of the protein 
on their surface, and reintroduce the transfected cells into the patient. The infected cells would 
now be capable of delivering a costimulatory signal to, and thereby activate, T cells in vivo. 
In another application, up regulation or enhancement of antigen function (preferably 

15 B lymphocyte antigen function) may be useful in the induction of tumor immunity. Tumor 
cells {e.g., sarcoma, melanoma, lymphoma, leukemia, neuroblastoma, carcinoma) transfected 
with a nucleic acid encoding at least one peptide of the present invention can be administered 
to a subject to overcome tumor-specific tolerance in the subject. If desired, the tumor cell can 
be transfected to express a combination of peptides. For example, tumor cells obtained from 

20 a patient can be transfected ex vivo with an expression vector directing the expression of a 
peptide having B7-2-like activity alone, or in conjunction with a peptide having B7-l-!ike 
activity and/or B7-3-like activity. The transfected tumor cells are returned to the patient to 
result in expression of the peptides on the surface of the transfected cell. Alternatively, gene 
therapy techniques can be used to target a tumor cell for transfection in vivo. 

25 The presence of the peptide of the present invention having the activity of a B 

lymphocyte antigen(s) on the surface of the tumor cell provides the necessary cosiimulation 
signal to T cells to induce a T cell mediated immune response against the transfected tumor 
cells. In addition, tumor cells which lack MHC class I or MHC class II molecules, or which 
fail to reexpress sufficient amounts of MHC class I or MHC class II molecules, can be 

30 transfected with nucleic acid encoding all or a portion of (e.^.. a cytoplasmic-domain truncated 
portion) of an MHC class I a chain protein and microglobulin protein or an MHC class II 
a chain protein and an MHC class II p chain protein to thereby express MHC class I or MHC 
class II proteins on the cell surface. Expression of the appropriate class I or class II MHC in 
conjunction with a peptide having the activity of a B lymphocyte antigen {e.g., B7-1, B7-2, B7- 

35 3) induces a T cell mediated immune response against the transfected tumor cell. Optionally, 
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a gene encoding an aniisense constaict which blocks expression of an MHC class II associated 
protein, such as the invariant chain, can also be cotransfected with a DNA encoding a peptide 
having the activity of a B lymphocyte antigen to promote presentation of tumor associated 
antigens and induce tumor specific immunity. Thus, the induction of a T cell mediated 
5 immune response in a human subject may be sufficient to overcome tumor-specific tolerance 
in the subject. 

The activity of a protein of the invention may, among other means, be measured by the 
following methods: 

Suitable assays for thymocyte or splenocyte cytotoxicity include, without limitation. 

10 those described in: Current Protocols in Immunology, Ed by J. E. Coligan, A.M. Kruisbeek, 
D.H. Margulies, E.M. Shevach, W Strober, Pub. Greene Publishing Associates and Wiley- 
Interscience (Chapter 3, In Vitro assays for Mouse Lymphocyte Function 3.1-3.19; Chapter 7, 
Immunologic studies in Humans); Hermiann et al.. Proc. Natl. Acad. Sci. USA 78:2488-2492, 
1981; Herrmann et al., J. Immunol. 128:1968-1974, 1982; Handa et al., J. Immunol. 

15 135:1564-1572, 1985; Takai et al., J. Immunol. 137:3494-3500, 1986; Takai et al., J. 
Immunol 140:508-512, 1988; Herrmann et al., Proc. Natl. Acad. Sci. USA 78:2488-2492, 
1981; Herrmann et al., J. Immunol. 128:1968-1974, 1982; Handa et al., J. Immunol. 
135:1564-1572, 1985; Takai el al., J. Immunol. 137:3494-3500, 1986; Bowmanet al., J. 
Virology 61:1992-1998; Takai et al., J. Immunol 140:508-512, 1988; Bertagnolli et al, 

20 Cellular Immunology 133:327-341, 1991; Brown et al, J. Immunol 153:3079-3092, 1994. 

Assays for T-cell-dependent immunoglobulin responses and isotypc switching (which 
will identify, among others, proteins that modulate T-cell dependent antibody responses and 
that affect Th1/Th2 profiles) include, without limitation, those described in: Maliszewski, J. 
Immunol 144:3028-3033, 1990; and Assays for B cell function: In vitro antibody production, 

25 Mond, J.J. and Brunswick, M. In Current Protocols in Immunology. J.E.e.a. Coligan eds. Vol 
1 pp. 3.8.1-3.8.16, John Wiley and Sons, Toronto. 1994. 

Mixed lymphocyte reaction (MLR) assays (which will identify, among others, proteins 
that generate predominantly Thl and CTL responses) include, without limitation, those 
described in: Current Protocols in Immunology, Ed by J. E. Coligan, A.M. Kruisbeek, D.H. 

30 Margulies, E.M. Shevach, W Strober, Pub. Greene Publishing Associates and Wiley- 
Interscience (Chapter 3, In Vitro assays for Mouse Lymphocyte Function 3.1-3.19; Chapter 7, 
Immunologic studies in Humans); Takai et al, J. Immunol 137:3494-3500, 1986; Takai et al, 
J. Immunol. 140:508-512, 1988; Bertagnolli et al, J. Immunol 149:3778-3783, 1992. 

Dendritic celt-dependent assays (which will identify, among others, proteins expressed 

35 by dendritic cells that activate naive T-cells) include, without limitation, those described in: 
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Guery et aL, J. Immunol. 134:536-544, 1995; Inaba et al., Journal of Experimental Medicine 
173:549-559. 1991; Macalonia et ah. Journal of Immunology 154:5071-5079. 1995; Porgador 
et al., Journal of Experimental Medicine 182:255-260, 1995; Nair et al.. Journal of Virology 
67:4062-4069. 1993; Huang et al., Science 264:961-965, 1994; Macatonia et al.. Journal of 
5 Experimental Medicine 169:1255-1264. 1989; Bhardwaj et al.. Journal of Clinical 
Investigation 94:797-807, 1994; and Inaba et al. Journal of Experimental Medicine 172:631- 
640, 1990. 

Assays for lymphocyte survival/apoptosis (which will identify, among others, proteins 
that prevent apoptosis after superantigen induction and proteins that regulate lymphocyte 

10 homeostasis) include, without limitation, those described in: Darzynkiewicz et al., Cytometry 
13:795-808, 1992; Gorczyca et al.. Leukemia 7:659-670, 1993; Gorczyca et al, Cancer 
Research 53:1945-1951, 1993; Itoh et aL, Cell 66:233-243. 1991; Zacharchuk, Journal of 
Immunology 145:4037-4045. 1990; Zamai et al., Cytometry 14:891-897, 1993; Gorczyca et 
aL, International Journal of Oncology 1:639-648. 1992. 

15 Assays for proteins that influence early steps of T-cell commitment and development 

include, without limitation, those described in: Antica et aL, Blood 84:1 11-117, 1994; Fine 
et aL, Cellular Immunology 155:1 1 1-122. 1994; Galy et aL. Blood 85:2770-2778, 1995; Toki 
et aL, Proc. Nat. Acad Sci. USA 88:7548-7551, 1991. 

20 Hematopoiesis Regulating Activity 

A protein of the present invention may be useful in regulation of hematopoiesis and. 
consequently, in the treatment of myeloid or lymphoid cell deficiencies. Even marginal 
biological activity in support of colony forming cells or of factor-dependent cell lines indicates 
involvement in regulating hematopoiesis, e.g. in supporting the growth and proliferation of 

25 crythroid progenitor cells alone or in combination with other cytokines, thereby indicating 
utility, for example, in treating various anemias or for use in conjunction with 
irradiation/chemotherapy to stimulate the production of erythroid precursors and/or erythroid 
cells; in supporting the growth and proliferation of myeloid cells such as granulocytes and 
monocytes/macrophages (i.e., traditional CSF activity) useful, for example, in conjunction with 

30 chemotherapy to prevent or treat consequent myelo-suppression; in supporting the growth and 
proliferation of megakaryocytes and consequently of platelets thereby allowing prevention or 
treatment of various platelet disorders such as thrombocytopenia, and generally for use in place 
of or complimentary to platelet transfusions; and/or in supporting the growth and proliferation 
of hematopoietic stem cells which are capable of maturing to any arid all of the above- 

35 mentioned hematopoietic cells and therefore find therapeutic utility in various stem cell 
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disorders (such as those usually treated with transplantation, including, without limitation^ 
aplastic anemia and paroxysmal nocturnal hemoglobinuria), as well as in repopulating the stem 
cell compartment post irradiation/chemotherapy, either in-vivo or ex-vivo (i.e., in conjunction 
with bone marrow transplantation or with peripheral progenitor cell transplantation 
5 (homologous or heterologous)) as norma! cells or genetically manipulated for gene therapy. 

The activity of a protein of the invention may, among other means, be measured by the 
following methods: 

Suitable assays for proliferation and differentiation of various hematopoietic lines are 
cited above. 

10 Assays for embryonic stem cell differentiation (which will identify, among others, 

proteins that influence embryonic differentiation hematopoiesis) include, without limitation, 
those described in: Johansson et al. Cellular Biology 15:141 -151, 1 995; Keller et al.. Molecular 
and Cellular Biology 13:473-486, 1993; McClanahan et al.. Blood 81:2903-2915, 1993. 

Assays for stem cell survival and differentiation (which will identify, among others, 

15 proteins that regulate lympho-hematopoiesis) include, without limitation, those described in: 
Methylcellulose colony forming assays, Freshney, M.G. In Culture of Hematopoietic Cells. R.I. 
Freshney, et al, eds. Vol pp. 265-268, Wiley-Liss, Inc., New York, NY. 1994; Hirayama et 
al., Proc. Natl. Acad. Sci. USA 89:5907-591 1, 1992; Primitive hematopoietic colony forming 
cells with high proliferative potential, McNiece, I.K. and Briddell, R.A. In Culture of 

20 Hematopoietic Cells, R.I. Freshney, et al eds. Vol pp. 23-39, Wiley-Liss, Inc., New York, 
NY. 1994; Neben et al.. Experimental Hematology 22:353-359, 1994; Cobblestone area 
forming cell assay, Ploemacher, R.E. In Culture of Hematopoietic Cells. R.I. Freshney, et al. 
eds. Vol pp. 1-21, Wiley-Liss, Inc., New York, NY. 1994; Long term bone marrow cultures 
in the presence of stromal cells, Spooncer, E., Dexter, M. and Allen, T. In Culture of 

25 Hematopoietic Cells, R.I. Freshney, et al, eds. Vol pp. 163-179, Wiley-Liss. Inc., New York, 
NY. 1994; Long term culture initiating cell assay, Sutherland, H.J. In Culture of Hematopoietic 
Cells. R.I. Freshney. etal, eds. Vol pp. 139-162, Wiley-Liss, Inc., New York, NY. 1994. 

Tissue Growth Activity 

30 A protein of the present invention also may have utility in compositions used for bone, 

cartilage, tendon, ligament and/or nerve tissue growth or regeneration, as well as for wound 
healing and tissue repair and replacement, and in the treatment of bums, incisions and ulcers. 

A protein of the present invention, which induces cartilage and/or bone growth in 
circumstances where bone is not normally formed, has application in the healing of bone 

35 fractures and cartilage damage or defects in humans and other animals. Such a preparation 
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employing a protein of the invention may have prophylactic use in closed as well as open 
fracture reduction and also in the improved fixation of artificial joints. De novo bone 
formation induced by an osteogenic agent contributes to the repair of congenital, trauma 
induced, or oncologic resection induced craniofacial defects, and also is useful in cosmetic 
5 plastic surgery. 

A protein of this invention may also be used in the treatment of periodontal disease, 
and in other tooth repair processes. Such agents may provide an environment to attract bone- 
forming cells, stimulate growth of bone-forming cells or induce differentiation of progenitors 
of bone-forming cells. A protein of the invention may also be useful in the treatment of 

10 osteoporosis or osteoarthrids, such as through stimulation of bone and/or cartilage repair or by 
blocking inflammation or processes of tissue destruction (collagenase activity, osteoclast 
activity, etc.) mediated by inflammatory processes. 

Another category of tissue regeneration activity that may be attributable to the protein 
of the present invention is tendon/ligament formation. A protein of the present invention, 

15 which induces tendon/ligament-like tissue or other tissue formation in circumstances where 
such tissue is not normally formed, has application in the healing of tendon or ligament tears, 
deformities and other tendon or ligament defects in humans and other animals. Such a 
preparation employing a tendon/ligament-like tissue inducing protein may have prophylactic 
use in preventing damage to tendon or ligament tissue, as well as use in the improved fixation 

20 of tendon or ligament to bone or other tissues, and in repairing defects to tendon or ligament 
tissue. De novo tendon/ligament-like tissue formation induced by a composition of the present 
invention conUibutes to the repair of congenital, trauma induced, or other tendon or ligament 
defects of other origin, and is also useful in cosmetic plastic surgery for attachment or repair 
of tendons or ligaments. The compositions of the present invention may provide an 

25 environment to attract tendon- or ligament-forming cells, stimulate growth of tendon- or 
ligament-forming cells, induce differentiation of progenitors of tendon- or ligament-forming 
cells, or induce growth of tendon/ligament cells or progenitors ex vivo for return in vivo to 
effect tissue repair. The compositions of the invention may also be useful in the treatment of 
tendinitis, carpal tunnel syndrome and other tendon or ligament defects. The compositions 

30 may also include an appropriate matrix and/or sequestering agent as a carrier as is well known 
in the art. 

The protein of the present invention may also be useful for proliferation of neural cells 
and for regeneration of nerve and brain tissue, i.e. for the treatment of centnil and peripheral 
nervous system diseases and neuropathies, as well as mechanical and traumatic disorders, 
35 which involve degeneration, death or trauma to neural cells or nerve tissue. More specifically, 
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a protein may be used in the treatment of diseases of the peripheral nervous system, such as 
peripheral nerve injuries, peripheral neuropathy and localized neuropathies, and central 
nervous system diseases, such as Alzheimer's, Parkinson's disease, Huntington's disease, 
amyotrophic lateral sclerosis, and Shy-Drager syndrome. Further conditions which may be 
5 treated in accordance with the present invention include mechanical and traumatic disorders, 
such as spinal cord disorders, head trauma and cerebrovascular diseases such as stroke. 
Peripheral neuropathies resulting from chemotherapy or other medical therapies may also be 
treatable using a protein of the invention. 

Proteins of the invention may also be useful to promote better or faster closure of non- 
10 healing wounds, including without limitation pressure ulcers, ulcers associated with vascular 
insufficiency, surgical and traumatic wounds, and the like. 

It is expected that a protein of the present invention may also exhibit activity for 
generation or regeneration of other tissues, such as organs (including, for example, pancreas, 
liver, intestine, kidney, skin, endothelium), muscle (smooth, skeletal or cardiac) and vascular 
1 5 (including vascular endothelium) tissue, or for promoting the growth of cells comprising such 
tissues. Part of the desired effects may be by inhibition or modulation of fibrotic scarring to 
allow normal tissue to regenerate. A protein of the invention may also exhibit angiogenic 
activity. 

A protein of the present invention may also be useful for gut protection or regeneration 
20 and treatment of lung or liver fibrosis, reperfusion injury in various tissues, and conditions 
resulting from systemic cytokine damage. 

A protein of the present invention may also be useful for promoting or inhibiting 
differentiation of tissues described above from precursor tissues or cells; or for inhibiting the 
growth of tissues described above. 
25 The activity of a protein of the invention may, among other means, be measured by the 

following methods: 

Assays for tissue generation activity include, without limitation, those described in: 
International Patent Publication No. WO95/16035 (bone, cartilage, tendon); International 
Patent Publication No. WO95/05846 (nerve, neuronal); International Patent Publication No. 
30 W09 1/07491 (skin, endothelium ). 

Assays for wound healing activity include, without limitation, those described in: 
Winter, Epidermal Wound Healing , pps. 71-112 (Maibach, HI and Rovec, DT, eds.). Year 
Book Medical Publishers, Inc., Chicago, as modified by Eaglstein and Mertz, J. Invest. 
Dermatol 71:382-84 (1978). 

35 
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Activin/Inhibin Activity 

A protein of the present invention may also exhibit activin- or inhibin-related 
activities. Inhibins are characterized by their ability lo inhibit the release of follicle stimulating 
hormone (FSH), while activins and are characterized by their ability to stimulate the release 

5 of follicle stimulating hormone (FSH). Thus, a protein of the present invention, alone or in 
heterodimers with a member of the inhibin a family, may be useful as a contraceptive based 
on the ability of inhibins to decrease fertility in female mammals and decrease spermatogenesis 
in male mammals. Administration of sufficient amounts of other inhibins can induce infertility 
in these mammals. Alternatively, the protein of the invention, as a homodimer or as a 
10 heterodimer with other protein subunits of the inhibin-p group, may be useful as a fertility 
inducing therapeutic, based upon the ability of activin molecules in stimulating FSH release 
from cells of the anterior pituitary. See, for example, United Stales Patent 4,798,885. A 
protein of the invention may also be useful for advancement of the onset of fertility in sexually 
immature mammals, so as to increase the lifetime reproductive performance of domestic 

1 5 animals such as cows, sheep and pigs. 

The activity of a protein of the invention may, among other means, be measured by the 
following methods: 

Assays for activin/inhibin activity include, without limitation, those described in: Vale 
et al.. Endocrinology 91 :562-572, 1972; Ling et al.. Nature 321 :779-782, 1986; Vale et al., 
20 Nature 321:776-779, 1986; Mason et al., Nature 318:659-663, 1985; Forage et al., Proc. Natl. 
Acad. Sci. USA 83:3091-3095, 1986. 

Chemotactic/Chemokinetic Acti vi tv 

A protein of the present invention may have chemotactic or chemokineiic activity (e.g., 
25 act as a chemokine) for mammalian cells, including, for example, monocytes, fibroblasts, 
neutrophils, T-cells, mast cells, eosinophils, epithelial and/or endothelial cells. Chemotactic 
and chemokinetic proteins can be used to mobilize or attract a desired cell population to a 
desired site of action. Chemotactic or chemokinetic proteins provide particular advantages in 
treatment of wounds and other trauma to tissues, as well as in treatment of localized infections. 
30 For example, attraction of lymphocytes, monocytes or neutrophils to tumors or sites of 
infection may result in improved immune responses against the tumor or infecting agent. 

A protein or peptide has chemotactic activity for a particular cell population if it can 
stimulate, directly or indirectly, the directed orientation or movement of such cell population. 
Preferably, the protein or peptide has the ability to directly stimulate directed movement of 
35 cells. Whether a particular protein has chemotactic activity for a population of cells can be 
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readily determined by employing such protein or peptide in any known assay for cell 
chemotaxis. 

The activity of a protein of the invention may, among other means» be measured by the 
following methods: 

5 Assays for chemotactic activity (which will identify proteins that induce or prevent 

chemotaxis) consist of assays that measure the ability of a protein to induce the migration of 
cells across a membrane as well as the ability of a protein to induce the adhesion of one cell 
population to another cell population. Suitable assays for movement and adhesion include, 
without limitation, those described in: Current Protocols in Immunology, Ed by J.E. Coligan. 
10 A.M. Kruisbeek, D.H, Margulies. E.M. Shevach. W.Strober, Pub. Greene Publishing 
Associates and Wiley-Interscience (Chapter 6. 12, Measurement of alpha and beta Chemokines 
6.12.1-6.12.28; Taub et al. J. Clin. Invest. 95:1370-1376, 1995; Lind et al. APMIS 
103:140-146, 1995; Mulleret al Eur. J. Immunol. 25: I744-I748; Gmberet al. J. of Immunol. 
152:5860-5867, 1994; Johnston et al. J. of Immunol. 153: 1762-1768, 1994. 

15 

Hemostatic and Thrombolytic Activity 

A protein of the invention may also exhibit hemostatic or thrombolytic activity. As 
a result, such a protein is expected to be useful in treatment of various coagulation disorders 
(including hereditary disorders, such as hemophilias) or to enhance coagulation and other 
20 hemostatic events in treating wounds resulting from trauma, surgery or other causes. A protein 
of the invention may also be useful for dissolving or inhibiting formation of thromboses and 
for treatment and prevention of conditions resulting therefrom (such as, for example, infarction 
of cardiac and central nervous system vessels (e.g., stroke). 

The activity of a protein of the invention may, among other means, be measured by the 
25 following methods: 

Assay for hemostatic and thrombolytic activity include, without limitation, those 
described in: Linet et al., J. Clin. Pharmacol. 26:131-140, 1986; Burdick et a!., Thrombosis 
Res. 45:413-419. 1987; Humphrey et ah. Fibrinolysis 5:71-79 (1991); Schaub, Prostaglandins 
35:467-474, 1988. 

30 

Receptor/Li ^and Activity 

A protein of the present invention may also demonstrate activity as receptors, receptor 
ligands or inhibitors or agonists of receptor/ligand interactions. Examples of such receptors 
and ligands include, without limitation, cytokine receptors and their ligands. receptor kinases 
35 and their ligands, receptor phosphatases and their ligands. receptors involved in cell-cell 
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interactions and their ligands (including without limitation, cellular adhesion molecules (such 
as selectins, integrins and their ligands) and receptor/ligand pairs involved in antigen 
presentation, antigen recognition and development of cellular and humoral immune responses). 
Receptors and ligands are also useful for screening of potential peptide or small molecule 
5 inhibitors of the relevant receptor/ligand interaction. A protein of the present invention 
(including, without limitation, fragments of receptors and ligands) may themselves be useful 
as inhibitors of receptor/ligand interactions. 

The activity of a protein of the invention may, among other means, be measured by the 
following methods: 

10 Suitable assays for receptor-ligand activity include without limitation those described 

intCurrent Protocols in Immunology, Ed by J.E. Coligan, A.M. Kniisbeek, D.H. Margulies, 
E.M. Shevach, W.Strober, Pub. Greene Publishing Associates and Wiley-lnlerscience 
(Chapter 7.28. Measurement of Cellular Adhesion under static conditions 7.28.1-7.28.22). 
Takai et al.. Proc. Natl. Acad. Sci. USA 84:6864-6868. 1987; Bierer et al. J. Exp. Med, 

15 168:1145-1156. 1988; Rosenstein etal., J. Exp. Med. 169:149-160 1989; Stoltenborg et 
al., J. Immunol. Methods 175:59-68, 1994; Stitt et al. Cell 80:661-670, 1995. 



Anti-Infiammatorv Activity 

Proteins of the present invendon may also exhibit anti-inflammatory activity. The anti- 
20 inflammatory activity may be achieved by providing a stimulus to cells involved in the 
inflammatory response, by inhibiting or promoting cell-cell interactions (such as. for example, 
cell adhesion), by inhibiting or promoting chemotaxis of cells involved in the inflammatory 
process, inhibiting or promoting cell extravasation, or by stimulating or suppressing production 
of other factors which more directly inhibit or promote an inflammatory response. Proteins 
25 exhibiting such activities can be used to treat inflammatory conditions including chronic or 
acute conditions), including without limitation inflammation associated with infection (such 
as septic shock, sepsis or systemic inflammatory response syndrome (SIRS)), ischemia- 
reperfusion injury, endotoxin lethality, arthritis, complement-mediated hyperacute rejection, 
nephritis, cytokine or chemokine-induced lung injury, inflammatory bowel disease. Crohn's 
30 disease or resulting from over production of cytokines such as TNF or IL-1 . Proteins of the 
invention may also be useful to treat anaphylaxis and hypersensitivity to an antigenic substance 
or material. 

Tumor Inhibition Activity 
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In addition to the activities described above for immunological treatment or prevention 
of tumors, a protein of the invention may exhibit other anti-tumor activities. A protein may 
inhibit tumor growth directly or indirectly (such as, for example, via ADCC). A protein may 
exhibit its tumor inhibitory activity by acting on tumor tissue or tumor precursor tissue, by 
5 inhibiting formation of tissues necessary to support tumor growth (such as. for example, by 
inhibiting angiogenesis). by causing production of other factors, agents or cell types which 
inhibit tumor growth, or by suppressing, eliminating or inhibiting factors, agents or cell types 
which promote tumor growth. 

10 

Other Activities 

A protein of the invention may also exhibit one or more of the following additional 
activities or effects: inhibiting the growth, infection or function of. or killing, infectious agents, 
including, without limitation, bacteria, vinises, fungi and other parasites; effecting (suppressing 
1 5 or enhancing) bodily characteristics, including, without limitation, height, weight, hair color, 
eye color, skin, fat to lean ratio or other Ussue pigmentation, or organ or body part size or shape 
(such as, for example, breast augmentation or diminution, change in bone form or shape); 
effecting biorhythms or caricadic cycles or rhythms; effecting the fertility of male or female 
subjects; effecting the metabolism, catabolism, anabolism. processing, utilization, storage or 
20 elimination of dietary fat. lipid, protein, carbohydrate, vitamins, minerals, cofactors or other 
nutritional factors or component(s); effecting behavioral characteristics, including, without 
limitation, appetite, libido, stress, cognition (including cognitive disorders), depression 
(including depressive disorders) and violent behaviors; providing analgesic effects or other 
pain reducing effects; promoting differentiation and growth of embryonic stem cells in lineages 
25 other than hematopoietic lineages; hormonal or endocrine activity; in the case of enzymes, 
correcting deficiencies of the enzyme and treating deficiency-related diseases; treatment of 
hyperproliferative disorders (such as, for example, psoriasis); immunoglobulin-like activity 
(such as, for example, the ability to bind antigens or complement); and the ability to act as an 
antigen in a vaccine composition to raise an immune response against such protein or another 
30 material or entity which is cross-reactive with such protein. 



35 



ADMINISTRATION AND nnsfNJP- 

A protein of the present invention (from whatever source derived, including without 
limitation from recombinant and non-recombinant sources) may be used in a pharmaceutical 
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composition when combined with a pharmaceulically acceptable carrier. Such a composition 
may also contain (in addition to protein and a carrier) diluents, fillers, salts, buffers, stabilizers, 
solubilizers, and other materials well known in the art. The term "pharmaceutically 
acceptable" means a non-toxic material that does not interfere with the effectiveness of the 
5 biological activity of the active ingredient(s). The characteristics of the carrier will depend on 
the route of administration. The pharmaceutical composition of the invention may also contain 
cytokines, lymphokines, or other hematopoietic factors such as M-CSF, GM-CSF, TNF, IL-1 . 
IL-2, IL-3, IL-4, IL-5, rL-6, IL-7, IL-8, IL-9, IL-10, IL-1 1, IL-H, IL-13, E.-14, lL-15, IFN. 
TNFO, TNFl, TNF2, G-CSF, Meg-CSF, thrombopoietin, stem cell factor, and erythropoietin. 
10 The pharmaceutical composition may further contain other agents which either enhance the 
activity of the protein or compliment its activity or use in treatment. Such additional factors 
and/or agents may be included in the pharmaceutical composition to produce a synergistic 
effect with protein of the invention, or to minimize side effects. Conversely, protein of the 
present invention may be included in formulations of the particular cytokine, lymphokine, other 
15 hematopoietic factor, thrombolytic or anti-thrombotic factor, or anti-inflammatory agent to 
minimize side effects of the cytokine, lymphokine, other hematopoietic factor, thrombolytic 
or anti-thrombotic factor, or anti-inflammatory agent. 

A protein of the present invention may be active in multimers (e.g., heterodimers or 
homodimers) or complexes with itself or other proteins. As a result, pharmaceutical 
20 compositions of the invention may comprise a protein of the invention in such multimeric or 
complexed form. 

The pharmaceutical composition of the invention may be in the form of a complex of 
the protein(s) of present invention along with protein or peptide antigens. The protein and/or 
peptide antigen will deliver a stimulatory signal to both B and T lymphocytes. B lymphocytes 

25 will respond to antigen through their surface immunoglobulin receptor. T lymphocytes will 
respond to antigen through the T cell receptor (TCR) following presentation of the antigen by 
MHC proteins. MHC and structurally related proteins including those encoded by class 1 and 
class n MHC genes on host cells; will serve to present the peptide antigen(s) to T lymphocytes. 
The andgen components could also be supplied as purified MHC-peptide complexes alone or 

30 with co-stimulatory molecules that can directly signal T cells. Alternatively antibodies able 
to bind surface immunolgobulin and other molecules on B cells as well as antibodies able to 
bind the TCR and other molecules on T cells can be combined with the pharmaceutical 
composition of the invention. 

The pharmaceutical composition of the invention may be in the form of a liposome in 

35 which protein of the present invention is combined, in addition to other pharmaccutically 
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acceptable carriers, with amphipathic agents such as lipids which exist in aggregated form as 
micelles, insoluble monolayers, liquid crystals, or lamellar layers in aqueous solution. Suitable 
lipids for liposomal formulation include, without limitation, monoglycerides, diglycerides, 
sulfatides, lysolecithin, phospholipids, saponin, bile acids, and the like. Preparation of such 
5 liposomal formulations is within the level of skill in the art, as disclosed, for example, in U.S. 
Patent No. 4,235,871 ; U.S. Patent No, 4,501.728; U.S. Patent No. 4,837,028; and U.S. Patent 
No. 4,737,323, all of which are incorporated herein by reference. 

As used herein, the term "therapeutically effective amount" means the total amount of 
each active component of the pharmaceutical composition or method that is sufficient to show 

1 0 a meaningful patient benefit, i.e., treatment, healing, prevention or amelioration of the relevant 
medical condition, or an increase in rate of treatment, healing, prevention or amelioration of 
such conditions. When applied to an individual acdve ingredient, administered alone, the term 
refers to that ingredient alone. When applied to a combination, the term refers to combined 
amounts of the active ingredients that result in the therapeutic effect, whether administered in 

15 combination, serially or simultaneously. 

In practicing the method of treatment or use of the present invention, a therapeutically 
effective amount of protein of the present invention is administered to a mammal having a 
condition to be treated. Protein of the present invention may be administered in accordance 
with the method of the invention either alone or in combination with other therapies such as 

20 treatments employing cytokines, lymphokines or other hematopoietic factors. When co- 
administered with one or more cytokines, lymphokines or other hematopoietic factors, protein 
of the preisent invention may be administered either simultaneously with the cytokine(s), 
lymphokine(s). other hematopoietic factor(s), thrombolytic or anti-ihrombotic factors, or 
sequentially. If administered sequentially, the attending physician will decide on the 

25 appropriate sequence of administering protein of the present invention in combination with 
cytokine(s), lymphokine(s), other hematopoietic factor(s). thrombolytic or anti-thrombotic 
factors. 

Administration of protein of the present invention used in the pharmaceutical 
composition or to practice the method of the present invention can be carried out in a variety 
30 of conventional ways, such as oral ingestion, inhalation, topical application or cutaneous, 
subcutaneous, intraperitoneal, parenteral or intravenous injection. Intravenous administration 
to the patient is preferred. 

When a therapeutically effective amount of protein of the present invention is 
administered orally, protein of the present invention will be in the form of a tablet, capsule, 
35 powder, solution or elixir. When administered in tablet form, the pharmaceutical composition 
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of the invention may additionally contain a solid carrier such as a gelatin or an adjuvant. The 
tablet, capsule, and powder contain from about 5 to 95% protein of the present invention, and 
preferably from about 25 to 90% protein of the present invention. When administered in liquid 
form, a liquid carrier such as water, petroleum, oils of animal or plant origin such as peanut oil, 
5 mineral oil, soybean oil, or sesame oil, or synthetic oils may be added. The liquid form of the 
pharmaceutical composition may further contain physiological saline solution, dextrose or 
other saccharide solution, or glycols such as ethylene glycol, propylene glycol or polyethylene 
glycol. When administered in liquid form, the pharmaceutical composition contains from 
about 0.5 to 90% by weight of protein of the present invention, and preferably from about 1 
1 0 to 50% protein of the present invention. 

When a therapeutically effective amount of protein of the present invention is 
administered by intravenous, cutaneous or subcutaneous injection, protein of the present 
invention will be in the form of a pyrogen-free. parenterally acceptable aqueous solution. The 
preparation of such parenterally acceptable protein solutions, having due regard to pH. 
15 isotonicity, stability, and the like, is within the skill in the art. A preferred pharmaceutical 
composition for intravenous, cutaneous, or subcutaneous injection should contain, in addition 
to protein of the present invention, an isotonic vehicle such as Sodium Chloride Injection, 
Ringer's Injection, Dextrose Injection, Dextrose and Sodium Chloride Injection, Lactated 
Ringer's Injection, or other vehicle as known in the art. The pharmaceutical composition of 
20 the present invention may also contain stabilizers, preservatives, buffers, antioxidants, or other 
additives known to those of skill in the art. 

The amount of protein of the present invention in the pharmaceutical composition of 
the present invention will depend upon the nature and severity of the condition being treated, 
and on the nature of prior treatments which the patient has undergone. Ultimately, the 
25 attending physician will decide the amount of protein of the present invention with which to 
treat each individual patient. Initially, the attending physician will administer low doses of 
protein of the present invention and observe the patient's response. Larger doses of protein of 
the present invention may be administered until the optimal therapeutic effect is obtained for 
the patient, and at that point the dosage is not increased further. It is contemplated that the 
30 various pharmaceutical compositions used to practice the method of the present invention 
should contain about 0.01 ng to about 100 mg (preferably about 0.1 jig to about 10 mg, more 
preferably about 0.1 pg to about 1 mg) of protein of the present invention per kg body weight. 

The duration of intravenous therapy using the pharmaceutical composition of the 
present invention will vary, depending on the severity of the disease being treated and the 
35 condition and potential idiosyncratic response of each individual patient. It is contemplated 
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that the duration of each application of the protein of the present invention will be in the range 
of 12 to 24 hours of continuous intravenous administration. Ultimately the attending physician 
will decide on the appropriate duration of intravenous therapy using the pharmaceutical 
composition of the present invention. 
5 Protein of the invention may also be used to immunize animals to obtain polyclonal 

and monoclonal antibodies which specifically react with the protein. Such antibodies may be 
obtained using either the entire protein or fragments thereof as an immunogen. The peptide 
immunogens additionally may contain a cysteine residue at the carboxyl terminus, and are 
conjugated to a hapten such as keyhole limpet hemocyanin (KLH). Methods for synthesizing 
10 such peptides are known in the art, for example, as in R.P. Merrifield. J. Amer.Chem.Soc. 85, 
2149-2154 (1963); J.L. Krstenansky, et aL, FEES Lett. 2U. 10 (1987). Monoclonal 
antibodies binding to the protein of the invention may be useful diagnostic agents for the 
immunodetection of the protein. Neutralizing monoclonal antibodies binding to the protein 
may also be useful therapeutics for both conditions associated with the protein and also in the 
1 5 treatment of some forms of cancer where abnormal expression of the protein is involved. In 
the case of cancerous cells or leukemic cells, neutralizing monoclonal antibodies against the 
protein may be useful in detecting and preventing the metastatic spread of the cancerous cells, 
which may be mediated by the protein. 

For compositions of the present invention which are useful for bone, cartilage, tendon 

20 or ligament regeneration, the therapeutic method includes administering the composition 
topically, systematically, or locally as an implant oi device. When administered, the 
therapeutic composition for use in this invention is, of course, in a pyrogen-free, 
physiologically acceptable form. Further, the composition may desirably be encapsulated or 
injected in a viscous form for delivery to the site of bone, cartilage or tissue damage. Topical 

25 administration may be suitable for wound healing and tissue repair. Therapeutically useful 
agents other than a protein of the invention which may also optionally be included in the 
composition as described above, may alternatively or additionally, be administered 
simultaneously or sequentially with the composition in the mediods of the invention. 
Preferably for bone and/or cartilage formation, the composition would include a matrix capable 

30 of delivering the protein-containing composition to the site of bone and/or cartilage damage, 
providing a structure for the developing bone and cartilage and optimally capable of being 
resorbed into the body. Such matrices may be formed of materials presently in use for other 
implanted medical applications. 

The choice of matrix material is based on biocompatibility, biodegradability, 

35 mechanical properties, cosmetic appearance and interface properties. The particular 
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application of the compositions will define the appropriate formulation. Potential matrices for 
the compositions may be biodegradable and chemically defined calcium sulfate, 
tricalciumphosphate, hydroxyapatite, polylactic acid, polyglycolic acid and polyanhydrides. 
Other potential materials are biodegradable and biologically well-defined, such as bone or 
5 dermal collagen. Further matrices are comprised of pure proteins or extracellular matrix 
components. Other potential matrices are nonbiodegradable and chemically defined, such as 
sintered hydroxapatite, bioglass, aluminates, or other ceramics. Matrices may be comprised 
of combinations of any of the above mentioned types of material, such as polylactic acid and 
hydroxyapatite or collagen and tricalciumphosphate. The bioceramics may be altered in 
1 0 composition, such as in calcium-aluminate-phosphate and processing to alter pore size, particle 
size, particle shape, and biodegradability. 

Presently preferred is a 50:50 (mole weight) copolymer of lactic acid and glycolic acid 
in the form of porous particles having diameters ranging from 150 to 800 microns. In some 
applications, it will be useful to utilize a sequestering agent, such as carboxymethyl cellulose 
15 or autologous blood clot, to prevent the protein compositions from disassociating from the 
matrix. 

A preferred family of sequestering agents is cellulosic materials such as alkylcelluloses 
(including hydroxyalkylcelluloses), including methylcellulose, ethylcellulose, 
hydroxyethylcellulose. hydroxypropylcellulose, hydroxypropyl-methylcellulose, and 
20 carboxymethylcellulose, the most preferred being cationic salts of carboxymethylcellulose 
(CMC). Other preferred sequestering agents include hyaluronic acid, sodium alginate, 
poly(ethylene glycol), polyoxyethylene oxide, carijoxyvinyl polymer and poly( vinyl alcohol). 
The amount of sequestering agent useful herein is 0.5-20 wt%, preferably 1-10 wt% based on 
total formulation weight, which represents the amount necessary to prevent desorbtion of the 
25 protein from the polymer matrix and to provide appropriate handling of the composition, yet 
not so much that the progenitor cells arc prevented from infiltrating the matrix, thereby 
providing the protein the opportunity to assist the osteogenic activity of the progenitor cells. 

In further compositions, proteins of the invention may be combined with other agents 
beneficial to the treatment of the bone and/or cartilage defect, wound, or tissue in quesUon. 
30 These agents include various growth factors such as epidermal growth factor (EGF), platelet 
derived growth factor (PDGF), transforming growth factors (TGF-a and TGF-P), and insulin- 
like growth factor (IGF). 

The therapeutic compositions are also presently valuable for veterinary applications. 
Particulariy domestic animals and thoroughbred horses, in addition to humans, are desired 
35 patients for such treatment with proteins of the present invention. 
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The dosage regimen of a protein-containing phannaceutical composition to be used 
in tissue regeneration will be determined by the attending physician considering various factors 
which modify the action of the proteins, e.g.. amount of tissue weight desired to be formed, the 
site of damage, the condition of the damaged tissue, the size of a wound, type of damaged 
5 tissue (e.g., bone), the patient's age, sex, and diet, the severity of any infection, time of 
administration and other clinical factors. The dosage may vary with the type of matrix used 
in the reconstitution and with inclusion of other proteins in the pharmaceutical composition. 
For example, the addition of other known growth factors, such as IGF I (insulin like growth 
factor 1), to the final composition, may also effect the dosage. Progress can be monitored by 
10 periodic assessment of tissue/bone growth and/or repair, for example, X-rays, 
histomorphometric determinations and tetracycline labeling. 

Polynucleotides of the present invention can also be used for gene therapy. Such 
polynucleotides can be introduced either in vivo or ex vivo into cells for expression in a 
mammalian subject. Polynucleotides of the invention may also be administered by other 
15 known methods for introduction of nucleic acid into a cell or organism (including, without 
limitation, in the form of viral vectors or naked DNA). 

Cells may also be cultured ex vivo in the presence of proteins of the present invention 
in order to proliferate or to produce a desired effect on or activity in such cells. Treated cells 
can then be introduced in vivo for therapeutic purposes. 

20 

Patent and literature references cited herein are incorporated by reference as if fully 
set forth. 



63 



wo 98/14470 



PCT/US97/i8032 

SEQUENCE LISTING 



(1) GENERAL INFORMATION: 

(i> APPLICANT: Jacobs, Kenneth 
McCoy, John 
LaVallie, Edward 
Racie, Lisa 
Merberg, David 
Treacy, Maurice 
Spaulding , Vikki 



(ii) TITLE OF INVENTION: SECRETED PROTEINS 
(iii) NUMBER OF SEQUENCES: 54 

(iv) CORRESPONDENCE ADDRESS: 

(A) ADDRESSEE; Genetics Institute, Inc. 

(B) STREET: 87 Cambridge Park Drive 

(C) CITY: Cambridge 

(D) STATE: Massachusetts 

(E) COUNTRY: U.S.A. 

(F) ZIP: 02140 

(v) COMPUTER READABLE FORM: 

(A) MEDIUM TYPE: Floppy disk 

(B) COMPUTER: IBM PC compatible 

(C) OPERATING SYSTEM: PC-DOS/MS-DOS 

(D) SOFTWARE: PatentIn Release #1.0, Version #1.30 

(vi) CURRENT APPLICATION DATA: 

(A) APPLICATION NUMBER: 

(B) FILING DATE: 

(C) CLASSIFICATION: 

(viii) ATTORNEY /AGENT INFORMATION: 

(A) NAME: Brown, Scott A. 

(B) REGISTRATION NUMBER: 32,724 

(ix) TELECOMMUNICATION INFORMATION: 

(A) TELEPHONE: (617) 498-8224 

(B) TELEFAX: (617) 876-5851 



(2) INFORMATION FOR SEQ ID N0:1: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 276 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : double 

(D) TOPOLOGY : linear 

(ii) MOLECULE TYPE: cDNA 
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(xi) SEQUENCE DESCRIPTION: SEQ ID N0:1: 

AAGCTTGGGG TTTTCTGGGC TACTACGATG GCGATGAGTT TCGAGTGGCC GTGGCAGTAC 60 

CGCTTCCCGC CCTTCTTTAC GTTACAGCCG AACGTGGACA CCCGGCAGAA GCAGCTGGCC 120 

GCCTGGTGCT CTCTGGTTCT GTCCTTCTGC CGCCTGCACA AACAGTCCAG CATGACGGTG 180 

ATGGAAGCCC AGGAGAGCCC GCTTTTCAAC AACGTCAAGC TACAGCGGAA ACTTCCTGTG 2 40 

GAGTCAATTC AGATTGTATT AGAAGAACTG AGAAAG 276 
(2) INFORMATION FOR SEQ ID NO: 2: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 83 amino acids 

(B) TYPE: amino acid 

(C) STRANDEDNESS : 

<D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: protein 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO : 2 : 

Met Ala Met Ser Phe Glu Trp Pro Trp Gin Tyr Arg Phe Pro Pro Phe 
1. 5 10 15 

Phe Thr Leu Gin Pro Asn Val Asp Thr Arg Gin Lys Gin Leu Ala Ala 
20 25 30 

Trp Cys Ser Leu Val Leu Ser Phe Cys Arg Leu His Lys Gin Ser Ser 
35 40 45 

Met Thr Val Met Glu Ala Gin Glu Ser Pro Leu Phe Asn Asn Val Lys 
50 55 60 

Leu Gin Arg Lys Leu Pro Val Glu Ser He Gin He Val Leu Glu Glu 
65 70 75 80 

Leu Arg Lys 



(2) INFORMATION FOR SEQ ID NO: 3: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 246 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 
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(ii) MOLECULE TYPE: cDNA 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 3: 
GTGAGTACAT ACACACANGC GCNTGCAGCA CANGATTNTG TCTCATCGTC NTCCCACCCN 
NNNNGGNGNN GNTGCCTCCC TTAGTCAGGN GANGATGNAT CCTTTCCNAG GGGNTGGGGG 
GNANCATTGG ATGCGGGCAG CNTTCCAGGC AANATGAAGA TNGGAGGCCC ACGGGCATGG 
CAGTGAGAGG NGTGGCCCCA CACNGATTTA TGATNTTGAA ATCTCAACTC CCAAAAAAGA 
AAAAAA 

(2) INFORMATION FOR SEQ ID NO : 4 : 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 632 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : double 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 4: 

AGCTTCGGAA TAATAATTTT GGCAAATCTA TCTTCTGAAC CACTCATTTC TGTGGTCTTA 60 

ATGGCTCCAA TTTGGGGACC AATAATGTTC ATTGTCTCAG GATCCCTGTC AATTGCAGCA 120 

GGAGTCAAAC CTACAAAAAG CCTGATCATC AGCAGTCTAA CTCTGAACAC TATCACCTCT 180 

GTGTTGGCTG CAACTGCAAG CATAATGGGT GTAGTCAGTG TGGCTGTGGG TTCACAGTTT 240 

CCGTTTCGGT ATAATTATAC AATCACCAAG GGTTTGGATA TTTTGATGTT AATTTTAAAT 300 

ATGCTAGAAT TCTGCATTGC TGTGTCCATC TCTGCTTTTG GATGTAAAGC TTCCTGTTGT 360 

AACTCCAGCG AGGTTCTTGT AGTGCTACCA TCAAATCCTG CTGTGACTGT GATGGCACCC 420 

CCCACACCAC TTAATGAAGG TTTGAGGCCA CCAAAAGATC AACAGACAAA TGCTCCAGAA 480 

ATCTATGCTG ACTGTGACAC AAGAAGCCTC ACATGAAGAA ATTACCAGTA TCCAACTTCG 540 

ATACTGATAG ACTTGTTGAT ATTATTATTA TATGTAATCC AATTATGAAC TGTGTGTGTA 600 
TAGAGAGATA ATAAATTCAA AATTATGTTC TC 
(2) INFORMATION FOR SEQ ID NO: 5: 
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(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 151 amino acids 

(B) TYPE: amino acid 

(C) STRANDEDNESS : 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: protein 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 5: 

Met Ala Pro lie Trp Gly Pro lie Met Phe lie Val Ser Gly Ser Leu 
15 10 15 

Ser He Ala Ala Gly Val Lys Pro Thr Lys Ser Leu He He Ser Ser 
20 25 30 

Leu Thr Leu Asn Thr He Thr Ser Val Leu Ala Ala Thr Ala Ser He 
35 40 45 

Met Gly Val Val Ser Val Ala Val Gly Ser Gin Phe Pro Phe Arg Tyr 
50 55 60 

Asn Tyr Thr He Thr Lys Gly Leu Asp He Leu Met Leu He Leu Asn 
65 70 75 80 

Met Leu Glu Phe Cys He Ala Val Ser He Ser Ala Phe Gly Cys Lys 
85 90 95 

Ala Ser Cys Cys Asn Ser Ser Glu Val Leu Val Val Leu Pro Ser Asn 
100 105 110 

Pro Ala Val Thr Val Met Ala Pro Pro Thr Pro Leu Asn Glu Gly Leu 
115 120 125 

Arg Pro Pro Lys Asp Gin Gin Thr Asn Ala Pro Glu He Tyr Ala Asp 
130 135 140 

Cys Asp Thr Arg Ser Leu Thr 
145 150 

(2) INFORMATION FOR SEQ ID NO : 6 : 

{i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 365 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 
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(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 6: 
CTATGGGGAC CAAAGTGNTT TTTCNTTCAG GAAGTGGAGA TGCATGGCCA TCTCCCCCTC 6 0 

CCTTTTTCCT TCTCNTGNTT TTCTTTCCCC ATAGAAAGTA CCTTGAAGTA GCACAGTCCG 120 
TCCTTGCATG TGCNCGNGCT NTCNTTTGAG TAAAAGTATA CATGGAGTAA AAATCATATT 180 
AAGCATCAGA TTCAACTTAT ATTTTNTATT TCATNTTCTT CCTTTCCCTT CTCCCACNTT 24 0 

NTACTGGGCA TAATTATATN TTAATCATAT ATGGAAATGT GCAACATATG GTATTTGTTA 300 
AATACGTTTG TTTTTATTGC AGAGCAAAAA TAAATCAAAT TAGAAGCAAA AAAAAAAAAA 360 
AAAAA 

(2) INFORMATION FOR SEQ ID NO : 7 : 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 689 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID N0:7: 

CCCANAGAGN CCTAGGAAGA TGAACAAACG ACAGCTCTAC TACCAGGTTT TAAACTTTGC 60 

CATGATCGTG TCTTCTGCGC TCATGATCTG GAAAGGCCTG ATTGTTCTCA CGGGCAGCGA 120 

GAGTCCCATC GTGGWGGTAC TCAGTGGCAG TATGGAGCCG GCCTTCCACA GAGGAGATCT 180 

BCTGTTCCTC ACGAATTTCC GGGAGGACCC CATCAGAGCT GGTGAAATAG TTGTTTTTAA 240 

GGTTGAAGGA AGAGACATTC CGATAGTTCA CAGAGTAATC AAGGTTCATG AAAAAGATAA 3 00 

TGGTGACATC AARTTTCTGA CTAAAGGAGA TAATAATGAA GTYGATGATA GAGGCTTGTA 360 

CAAAGAAGGC CAGAACTGGC TGGAAAAGAA GGACGTGGTG GGAAGAGCAA GANGGTTTTT 420 

ACCATATGTT GGTATGGTCA CCATAATAAT GAATGACTAT CCAAAATTCA AKTATGCTCT 480 

TTTGGCTGTA ATGGGTGCAT ATGTGTTACT AAAACGTGAA TCCTAAAATG AGAAGCAGTT 540 

CCTGGGACCA GATTGAAATG AATTCTGTTG AAAAAGAGAA AAACTAATAT ATTTGAGATG 600 

TTCCATTTTC TGTATAAAAG GGAACAGTGT GGAGATGTTT TTGTCTTGTC CAAATAAAAG 660 

ATTCACCAGT AAAAAAAAAA AAAAAAAAA 689 
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(2) INFORMATION FOR SEQ ID NO : 8 : 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 168 amino acids 

(B) TYPE: amino acid 

(C) STRANDEDNESS: 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: protein 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO : 8 : 

Met Asn Lys Arg Gin Leu Tyr Tyr Gin Val Leu Asn Phe Ala Met lie 
15 10 15 

Val Ser Ser Ala Leu Met lie Trp Lys Gly Leu lie Val Leu Thr Gly 
20 25 30 

Ser Glu Ser Pro lie Val Xaa Val Leu Ser Gly Ser Met Glu Pro Ala 
35 40 45 

Phe His Arg Gly Asp Leu Leu Phe Leu Thr Asn Phe Arg Glu Asp Pro 
50 55 60 

He Arg Ala Gly Glu He Val Val Phe Lys Val Glu Gly Arg Asp He 
65 70 75 80 

Pro He Val His Arg Val He Lys Val His Glu Lys Asp Asn Gly Asp 
85 90 95 

He Lys Phe Leu Thr Lys Gly Asp Asn Asn Glu Val Asp Asp Arg Gly 
100 105 110 

Leu Tyr Lys Glu Gly Gin Asn Trp Leu Glu Lys Lys Asp Val Val Gly 
115 120 125 

Arg Ala Arg Xaa Phe Leu Pro Tyr Val Gly Met Val Thr He He Met 
130 135 140 

Asn Asp Tyr Pro Lys Phe Xaa Tyr Ala Leu Leu Ala Val Met Gly Ala 
145 150 155 160 

Tyr Val Leu Leu Lys Arg Glu Ser 
165 

(2) INFORMATION FOR SEQ ID NO: 9: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 3 09 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 
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